kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
118 stars 27 forks source link

Acceptance of input string and symbols longer than one character #72

Closed gMontoyaSpeech closed 7 months ago

gMontoyaSpeech commented 8 months ago

I recently found this repo and I think it's got everything I'm looking for my project. This is exciting!

I've been following a beginners tutorial (https://gist.github.com/kylebgorman/7d406f577ef1922b2dd3a5ac52752dea) but yet I still have some fundamental questions I need to clear up before continuing.

  1. After checking some issues opened before, I could see that the add_symbol method is used to add a symbol to the symbol list:

    sym = pynini.SymbolTable()
    sym.add_symbol("a", ord("a"))

Some of my input symbols are composed by two characters (e.g "E+"). Unfortunately this can't passed to ord because it only accepts 1 character. How could I handle this? When I build the automata that accepts this ("E+") as a single symbol, a random number is assigned to it:

    pynini.accep("[E+][d]") ---> This yields to:
    0 1 983040 983040
    1 2 100 100
    3

The inconsistency on what number is gong to be generated by pynini and what I should use to to add E+ using add_symbol is kind of a blocker for me.

  1. After building a FST that looks like this:

    sheep = acceptor("b") + acceptor("a").plus

How can I pass an input string to see if the automaton accepts or rejects it? For example, I know that the string "dad" is not going to be accepted. So I was wondering if there's such a method like sheep.read_input("dad") that returns False or something similar. I already checked the available methods but I couldn't find anything alike. I saw the method read_from_string yet it didn't give me the expected results (unless I had wrongly used it).

Thanks all for you help!

kylebgorman commented 8 months ago

I recently found this repo and I think it's got everything I'm looking for my project. This is exciting!

I've been following a beginners tutorial (https://gist.github.com/kylebgorman/7d406f577ef1922b2dd3a5ac52752dea) but yet I still have some fundamental questions I need to clear up before continuing.

  1. After checking some issues opened before, I could see that the add_symbol method is used to add a symbol to the symbol list:
    sym = pynini.SymbolTable()
    sym.add_symbol("a", ord("a"))

Some of my input symbols are composed by two characters (e.g "E+"). Unfortunately this can't passed to ord because it only accepts 1 character. How could I handle this? When I build the automata that accepts this ("E+") as a single symbol, a random number is assigned to it:

    pynini.accep("[E+][d]") ---> This yields to:
    0 1 983040 983040
    1 2 100 100
    3

The inconsistency on what number is gong to be generated by pynini and what I should use to to add E+ using add_symbol is kind of a blocker for me.

There's no right answer in how you should number it. Just using the next number, which it does by default, is totally appropriate. (Note that integers <= 1 are all reserved and should not be used.) I'd do it something like this:

symbols = pynini.SymbolTable()  # I thought `sym` implies a single label not a table thereof.
e_plus = symbols.add_symbol("E+")
myfst.add_arc(state, pynini.Arc(e_plus, e_plus, ...))

That make sense? Alternatively, the string-to-symbol compiler can be passed the symbol table. E.g., if you have a string like:

myfst = pynini.accep("E+ E+ E+", token_type=symbols)

And it should parse the string as expected. See here for low-level documentation.

  1. After building a FST that looks like this:
    sheep = acceptor("b") + acceptor("a").plus

    How can I pass an input string to see if the automaton accepts or rejects it? For example, I know that the string "dad" is not going to be accepted. So I was wondering if there's such a method like sheep.read_input("dad") that returns False or something similar. I already checked the available methods but I couldn't find anything alike. I saw the method read_from_string yet it didn't give me the expected results (unless I had wrongly used it).

Just compose/intersect a string with the input, and test whether there are any paths through the machine. (By default it "connects" i.e., "trims" the machine after composition, so if there are no paths, the machine has no states.) Here's a simple helper:

def is_accepted(string: str, automaton: pynini.Fst) -> bool:
   lattice = pynini.intersect(string, automaton)
   return lattice.start() != pynini.NO_STATE_ID  # This is our convention for testing for an empty machine.

Alternatively, you can use a helper functions in pynini.lib.rewrite. Here's an example:

from pynini.lib import rewrite

accepted = rewrite.matches(string, string, automaton)

The helper functions are quite general and can be used for transducers and not just acceptors---that's why you have to repeat the string argument twice. As the documentation says, it returns "whether an input-output pair is generated by a rule". Or you can wrap that function further (with a lambda, or partial function application, or a wrapper function) so you don't have to repeat the argument, if you want.

gMontoyaSpeech commented 8 months ago

Hi @kylebgorman Thank you very much for the quick answer. I'm not sure if I totally understood your answer so I'll try them out and come back here very soon with the outcome. Makes me very happy to see that the issues here are quickly answered :)

gMontoyaSpeech commented 8 months ago

Hi again!

After spending another day reading/following different tutorials about this module I think I'm finally getting my head wrapped around how it works and how I could use it. However, I still have some questions left about weights. I noticed that the majority of the examples are about FSTs, and hardly some documentation about WFSTs has been written (or difficult to find?). Thus fundamental issues like how to add weights to transitions are not well covered. How could I do that?

As far as I remember, in openFST there was a plain text file that followed this kind of format: <init-state> <next-state> <isymbol> <osymbol> <weight> The only way I've found so far to achieve this is using this: Arc(ilabel, olabel, weight, nextstate) . I have lots of arcs in my automata, so would it be easier to generate a plain text and then importing it that to pynini?

I also have another question on how to use the SymbolTable elements. Let's suppose that my SymbolTable is composed by the following characters: {"<epsilon>","a","b","c","t","@","f","E","l"} and I wish to compose something like this: image

The returning arcs would be the set of all elements of the SymbolTable except the one that is needed to move to the next state. For example, if the current state is 1, then the returning arc should include {"<epsilon>","b","c","t","@","f","E","l"} since the symbol "a" is the accepted one to continue.

I hope to be clear with all these questions.

Thanks in advance!

kylebgorman commented 8 months ago

Hi again!

After spending another day reading/following different tutorials about this module I think I'm finally getting my head wrapped around how it works and how I could use it. However, I still have some questions left about weights. I noticed that the majority of the examples are about FSTs, and hardly some documentation about WFSTs has been written (or difficult to find?). Thus fundamental issues like how to add weights to transitions are not well covered. How could I do that?

As far as I remember, in openFST there was a plain text file that followed this kind of format: <init-state> <next-state> <isymbol> <osymbol> <weight> The only way I've found so far to achieve this is using this: Arc(ilabel, olabel, weight, nextstate) . I have lots of arcs in my automata, so would it be easier to generate a plain text and then importing it that to pynini?

There are a lot of ways to add weights. The accep function takes an optional weight argument if you're converting a string to an acceptor. If you have an already built automaton and want to add one weight to it, you can use pynini.lib.pynutil.add_weight. The pynini.string_map and pynini.string_file functions, which compile an acceptor or transducer from a TSV-file format (one string or input/output string pair per line) can also use weights.

Finally, that "AT&T" (we call it) plain text format is also supported using the lower-level pywrapfst interface, which comes with Pynini; pywrapfst.Compiler is the name of the class that makes automata from this. Note however that this produces instances of a less derived class than pynini.Fst and you may have to downcast them using pynini.Fst.from_pywrapfst if you want to use them with Pynini.

I also have another question on how to use the SymbolTable elements. Let's suppose that my SymbolTable is composed by the following characters: {"<epsilon>","a","b","c","t","@","f","E","l"} and I wish to compose something like this: image

The returning arcs would be the set of all elements of the SymbolTable except the one that is needed to move to the next state. For example, if the current state is 1, then the returning arc should include {"<epsilon>","b","c","t","@","f","E","l"} since the symbol "a" is the accepted one to continue.

Just like OpenFst itself, the algorithms (except those involved in conversion between strings and automata) doesn't care about symbol tables, taking them to be a pure convenience for human users. They only uses the underlying integers on the arcs, not the symbols. It's up to you to keep the symbol tables in sync. There are functions like pywrapfst.compact_symbol_table or pywrapfst.merge_symbol_table, and there are methods to attach or detach symbol tables to automata.

gMontoyaSpeech commented 8 months ago

Thanks for the answer :)

I'd already read about pynini.lib.pynutil.add_weight but unfortunately it assigns the weight to the last state. I'm trying to build a WFST (I'm going beyond the FSA) that has an independent weight per arc. That's why that option does not really help in this case. The pynini.string_map sounds appealing to me if it's possible to have different weights per line. Would that be possible?. Could you point me to an example where I can see how this method is used and what the TSV looks like?

kylebgorman commented 8 months ago

Read the tests in the repo, they’re extensive.

If you want to enter separate weights per arc the pywrapfst Compiler is your best bet, or just build the WFSA in pure Python using the methods.

On Wed, Dec 20, 2023 at 11:00 AM gMontoyaSpeech @.***> wrote:

Thanks for the answer :)

I'd already read about pynini.lib.pynutil.add_weight but unfortunately it assigns the weight to the last state. I'm trying to build a WFST (I'm going beyond the FSA) that has an independent weight per arc. That's why that option does not really help in this case. The pynini.string_map sounds appealing to me if it's possible to have different weights per line. Would that be possible?. Could you point me to an example where I can see how this method is used and what the TSV looks like?

— Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/72#issuecomment-1864736884, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OLQD3ZQTB7P2F62SATYKMDQRAVCNFSM6AAAAABA3TBW2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUG4ZTMOBYGQ . You are receiving this because you were mentioned.Message ID: @.***>

gMontoyaSpeech commented 8 months ago

Thanks for the advice. So I will give a go to pywrapfst now that I know that it suits better my needs. But ... now this just feels like being back to square one. I already spent some good amount of time going through the documentation / tutorials of pynini. Just for my understanding, if I build my WFST using the AT&T files + pywrapfst and then use pynini.Fst.from_pywrapfst to cast them to a pynini class, what kind of features are lost? why did you say in the answer before that this would be downcasting? Lastly, could I post questions related to pywrapfst here or should I discuss about them somewhere else? Thanks a mil!

kylebgorman commented 8 months ago

Pywrapfst is just the backend that Pynini is written on top of. I wrote both of them and they share a huge amount of API, maybe 80%. Pynini just adds a few specific operators and the ability to coerce between strings and FSTs. I am shaky on the "downcasting" vs. "upcasting" but pynini.Fst is a subclass of pywrapfst.VectorFst. I don't see why it would be an issue to move between the two APIs as needed.

If you don't want to use Pywrapfst directly like I suggested, just use the Pynini FST methods like add_state and add_arc to build up FSTs state by state and arc by arc.

To be honest, this is supposed to be for reporting bugs, though I do get a lot of questions here. I actually created a discussion forum for these libraries, but it didn't receive a single post in over a year, so I discontinued it. Politely and with all respect, if you want to ask developers a question, I'd suggest email...and in this case it sounds like you really ought to try something out first and see how far you get; send a Gist or something like that if you get stuck. There is also a book and a paper describing Pynini and related libraries if you're interested:

K. Gorman. 2016. Pynini: A Python library for weighted finite-state grammar compilation. In Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata, pages 75-80. K. Gorman & R. Sproat. 2021. Finite-State Text Processing . Morgan & Claypool.

There is also a OpenFst forum and people have sometimes posted Pynini and Pywrapfst questions there: https://www.openfst.org/twiki/bin/view/Forum/FstForum

gMontoyaSpeech commented 8 months ago

Thank you very much for these links. I'll thoroughly check them out. I indeed know that this platform is intended to report bugs rather than posting question, and that's why I asked you what I did. I already took a look to the forum but it seems like it's mostly about OpenFst instead of Pywrapfst. It's a pity that you discontinued the forum related to Pynini/pywrapfst. Since you said that you're the person who developed these tools, may I assume that I could contact you via email in case I get stuck? Thank you very much :)

kylebgorman commented 8 months ago

Yes please do!On Dec 21, 2023, at 11:52 AM, gMontoyaSpeech @.***> wrote: Thank you very much for these links. I'll thoroughly check them out. I indeed know that this platform is intended to report bugs rather than posting question, and that's why I asked you what I did. I already took a look to the forum but it seems like it's mostly about OpenFst instead of Pywrapfst. It's a pity that you discontinued the forum related to Pynini/pywrapfst. Since you said that you're the person who developed these tools, may I assume that I could contact you via email in case I get stuck? Thank you very much :)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

kylebgorman commented 7 months ago

Closing; you're welcome to contact me off-thread.