Open adelgiudice opened 6 days ago
Confirmed. I can reproduce this and also get "wi' lɛt ɡoʊ, ʃi' stɑp."
for the test.output_string
. We'll look into this - thank you for raising!
I noticed a (likely) associated issues with 's
and 'd
getting converted to 'ɛs
and 'ɛd
Upon further analysis, this is caused by an interaction with the tokenizer: the tokenizer is not aware that she'll
and he's
are words in the lexicon, it just tokenizes to [word she
, punct '
, word ll
]. And then the transducer looks up each word in the lexicon, where it finds she
but not ll
.
Similarly, he's
ends up looking up s
separately in the lexicon, where I assume ɛs
is the pronunciation of the letter itself.
With a non-tokenizing transducer, you get this:
>>> t = g2p.make_g2p("eng", "eng-ipa", tokenize=False)
<g2p.transducer.TransductionGraph object at 0x00000275D881EB00>
>>> t("she'll").pretty_edges()
[('s', 'ʃ'), ('h', 'ʃ'), ('e', 'i'), ("'", 'i'), ('l', 'l'), ('l', 'l')]
We'll have to do some more analysis to see if we can find a solution for the tokenizing transducer to handle these contractions.
Also, there's a related side bug you're seeing here: when a word is not in the lexicon, the output is empty (which is the expected behaviour), but the word is simply removed from the pretty edges, instead of showing deletion edges. For example, if you try with "oov", you'll see this:
>>> t("she oov is").pretty_edges()
[('s', 'ʃ'), ('h', 'ʃ'), ('e', 'i'), (' ', ' '), (' ', ' '), ('i', 'ɪ'), ('s', 'z')]
where we might have preferred something like
[('s', 'ʃ'), ('h', 'ʃ'), ('e', 'i'), (' ', ' '), ('o', ''), ('o', ''), ('v', ''), (' ', ' '), ('i', 'ɪ'), ('s', 'z')]
to show in the graph that oov
got deleted because it was not found in the lexicon.
Here is the code I'm using in python:
and the output:
We'll
andshe'll
might be tokenized aswe'
andshe'
. I'm not sure how or why this is happening or if this is understood behavior. I'm not sure how to debug it.