Possible issue with tokenization

roedoejet / g2p

Grapheme-to-Phoneme transductions that preserve input and output indices, and support cross-lingual g2p!

https://g2p-studio.herokuapp.com

Other

128 stars 27 forks source link

Possible issue with tokenization #401

Open adelgiudice opened 6 days ago

adelgiudice commented 6 days ago

Here is the code I'm using in python:

from g2p import make_g2p
TRANSDUCER = make_g2p('eng', 'eng-ipa')
test = TRANSDUCER("We'll let go, she'll stop.")
test.pretty_edges()

and the output:

[('W', 'w'),
 ('e', 'i'),
 ("'", "'"),
 (' ', ' '),
 ('l', 'l'),
 ('e', 'ɛ'),
 ('t', 't'),
 (' ', ' '),
 ('g', 'ɡ'),
 ('o', 'o'),
 ('o', 'ʊ'),
 (',', ','),
 (' ', ' '),
 ('s', 'ʃ'),
 ('h', 'ʃ'),
 ('e', 'i'),
 ("'", "'"),
 (' ', ' '),
 ('s', 's'),
 ('t', 't'),
 ('o', 'ɑ'),
 ('p', 'p'),
 ('.', '.')]

We'll and she'll might be tokenized as we' and she'. I'm not sure how or why this is happening or if this is understood behavior. I'm not sure how to debug it.

roedoejet commented 6 days ago

Confirmed. I can reproduce this and also get "wi' lɛt ɡoʊ, ʃi' stɑp." for the test.output_string. We'll look into this - thank you for raising!

adelgiudice commented 5 days ago

I noticed a (likely) associated issues with 's and 'd getting converted to 'ɛs and 'ɛd

joanise commented 5 days ago

Upon further analysis, this is caused by an interaction with the tokenizer: the tokenizer is not aware that she'll and he's are words in the lexicon, it just tokenizes to [word she, punct ', word ll]. And then the transducer looks up each word in the lexicon, where it finds she but not ll.

Similarly, he's ends up looking up s separately in the lexicon, where I assume ɛs is the pronunciation of the letter itself.

With a non-tokenizing transducer, you get this:

>>> t = g2p.make_g2p("eng", "eng-ipa", tokenize=False)
<g2p.transducer.TransductionGraph object at 0x00000275D881EB00>
>>> t("she'll").pretty_edges()
[('s', 'ʃ'), ('h', 'ʃ'), ('e', 'i'), ("'", 'i'), ('l', 'l'), ('l', 'l')]

We'll have to do some more analysis to see if we can find a solution for the tokenizing transducer to handle these contractions.

joanise commented 5 days ago

Also, there's a related side bug you're seeing here: when a word is not in the lexicon, the output is empty (which is the expected behaviour), but the word is simply removed from the pretty edges, instead of showing deletion edges. For example, if you try with "oov", you'll see this:

>>> t("she oov is").pretty_edges()
[('s', 'ʃ'), ('h', 'ʃ'), ('e', 'i'), (' ', ' '), (' ', ' '), ('i', 'ɪ'), ('s', 'z')]

where we might have preferred something like

[('s', 'ʃ'), ('h', 'ʃ'), ('e', 'i'), (' ', ' '), ('o', ''), ('o', ''), ('v', ''), (' ', ' '), ('i', 'ɪ'), ('s', 'z')]

to show in the graph that oov got deleted because it was not found in the lexicon.