kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
120 stars 26 forks source link

What is the correct epsilon handling in manually constructed symbol tables? #9

Closed wrznr closed 5 years ago

wrznr commented 5 years ago

Depicting the following simple transduction,

test = pynini.transducer("a", pynini.string_map(["a","b","c"]))

results in: debug2

Using a pre-defined symbol table however,

test = pynini.transducer("a", pynini.string_map(["a","b","c"], input_token_type=symbol_table, output_token_type=symbol_table), input_token_type=symbol_table, output_token_type=symbol_table)

results in: debug

The symbol table consequently does not contain <epsilon> and ! is mapped onto 0:

print(syms.member("<epsilon>")) # False
print(syms.member("!")) # True
print(syms.find("!")) # 0

It is constructed in this function.

Since ! is the first symbol added to the symbol table, the behavior somehow makes sense. But shouldn't epsilon be a reserved, specially treated symbol? What am I missing?

kylebgorman commented 5 years ago

The issue here, I suspect, is that arc label 0 is interpreted as epsilon, regardless of how you label it. Labels [1, …] are available but 0 is reserved for epsilon and negative labels for implementational details.

This was an early decision made in OpenFst (though I believe it was borrowed from AT&T’s FSM library much earlier) and it’s too late to do anything about it.

I believe this is documented somewhere though perhaps not as prominently as it ought to be.

Symbol tables have no effect on interpretation of FST operations except that rational operations like composition etc. will attempt to merge non-compatible symbol tables (see http://www.opengrm.org/twiki/bin/view/GRM/PyniniSymbolTableDoc).

On Nov 9, 2018, at 12:40 PM, Kay-Michael Würzner notifications@github.com wrote:

Depicting the following simple transduction,

test = pynini.transducer("a", pynini.string_map(["a","b","c"])) results in:

Using a pre-defined symbol table however,

test = pynini.transducer("a", pynini.string_map(["a","b","c"], input_token_type=symbol_table, output_token_type=symbol_table), input_token_type=symbol_table, output_token_type=symbol_table) results in:

The symbol table consequently does not contain and ! is mapped onto 0:

print(syms.member("")) # False print(syms.member("!")) # True print(syms.find("!")) # 0 It is constructed in this function.

Since ! is the first symbol added to the symbol table, the behavior somehow makes sense. But shouldn't epsilon be a reserved, specially treated symbol? What am I missing?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

wrznr commented 5 years ago

Thanks for your hints. Manually adding <epsilon> as the first symbol did the trick: I had a number of tranducers containing both <epsilon> and ! (most probably through symbol table merging), which had a rather weird structure after optimization. Now, they are optimized correctly! (I am really looking forward to the global symbol handling option.)

kylebgorman commented 5 years ago

All the pieces are now ready for releasing that, I believe.

I am traveling this week and away from the computer with the needed 2FA but I will be back at it on Friday. Will ping this thread then.

On Sat, Nov 10, 2018 at 1:50 PM Kay-Michael Würzner < notifications@github.com> wrote:

Thanks for your hints. Manually adding as the first symbol did the trick: I had a number of tranducers containing both and ! (most probably through symbol table merging), which had a rather weird structure after optimization. Now, they are optimized correctly! (I am really looking forward to the global symbol handling option.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/Pynini/issues/9#issuecomment-437615350, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuOdL2QUXIr5wYRUlVZp-KEcfJXXoaks5uty3sgaJpZM4YXAON .