kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
122 stars 27 forks source link

Disambiguate with epsilon symbols on input tape. #87

Open Mirmu opened 4 days ago

Mirmu commented 4 days ago

Hi!

This is not really an issue but rather a question / feature request.

I have a wFST which associates each input string to multiple (weighted) output strings and from that, I'd like to build an FST that maps each unique accepted input string to its lowest-cost output string.

I feel that something like pn.disambiguate or pn.determinize(*, det_type="disambiguate") would fit the bill. But the original FST contains arcs such as "eps: output_symbol" and those two functions consider epsilon as a standard symbols. Would you know if something is available in Pynini / OpenFST that ignores input epsilon arcs (or is it achievable by other means)?

Any help / pointers would be super helpful, thanks a lot 🙏

PS: thanks for the Pynini library, it's a life saver.

kylebgorman commented 4 days ago

So it's interesting to think what a determinization algorithm that "knows" about epsilons would be like, but I don't think we have a vision of that; I suspect that it's insoluble in the case where there are eps/output arcs, just like it is in many other cases involving transducers.

What you can do is to use other means to move around the epsilons, then determinize or disambiguate afterwards. A few pointers:

A third possibility is to use label encoding to "hide" epsilons and then determinize, and then decode. This is heuristic but it works pretty well. For an instance of this, see the implementation of optimize here or in chapter 4 (?) of the Pynini book.