roedoejet / g2p

Grapheme-to-Phoneme transductions that preserve input and output indices, and support cross-lingual g2p!
https://g2p-studio.herokuapp.com
Other
135 stars 26 forks source link

Danish g2p does not process "frygt" correctly #29

Closed joanise closed 4 years ago

joanise commented 4 years ago

The "r" in "frygt" incorrectly stays as is in the eng-arpabet output.

$ g2p convert   frygt dan eng-arpabet
INFO - Server initialized for eventlet.
F rUW Y T

This causes errors in readalongs align with the Danish UDHR.

A similar error occurs with "undertrykkelse":

$ g2p convert undertrykkelse dan eng-arpabet
INFO - Server initialized for eventlet.
UW N D EH Y T rUW K K EH L S EH

Source for these two words: https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=dns

roedoejet commented 4 years ago

Thanks, I would be surprised if my hastily made Danish rules would account for the UDHR, but this specific problem is fixed now anyways: 2eaf2e5

dhdaines commented 4 years ago

I realize this specific problem is fixed, but is there a way to make g2p not pass through unconverted input characters? It seems to come up repeatedly with various languages...

roedoejet commented 4 years ago

Hm. Currently not, but yes, I could see this being useful for debugging. I think it would get annoying outside of debugging - there are lots of characters that don't get typically converted (punctuation, numbers etc..). And, I think it's better to get an error like UW N D EH Y T rUW K K EH L S EH, than to silently not pass it through and get UW N D EH Y T UW K K EH L S EH instead which would be wrong and be ultimately harder to debug. Maybe -debug could give a list of untouched characters too, instead of just not letting them pass through?

littell commented 4 years ago

It should probably be a parameter in the API, like for orthography conversion you usually want OOVs to pass through, whereas for G2P you may want an error flag of some sort to tell the caller there's something strange about the input, so that it can try some other way of getting a pronunciation for that token.

This relates to another thing, does gi-to-pi know the input and output vocabularies of each mapping, so to be able to identify when improper inputs are submitted or improper outputs are generated? The original G2P subsystem in readalongs had both inventories and mappings (with inventories being generated from the mapping if no explicit inventory file was provided, which was usually the case, eng being the notable exception). That wasn't used for catching errors (although it should have been, we just never got to it), but it was used in systems like the language id system.

On Wed, Apr 8, 2020, 5:28 PM Aidan Pine notifications@github.com wrote:

Hm. Currently not, but yes, I could see this being useful for debugging. I think it would get annoying outside of debugging - there are lots of characters that don't get typically converted (punctuation, numbers etc..). And, I think it's better to get an error like UW N D EH Y T rUW K K EH L S EH, than to silently not pass it through and get UW N D EH Y T UW K K EH L S EH instead which would be wrong and be ultimately harder to debug. Maybe -debug could give a list of untouched characters too, instead of just not letting them pass through?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/roedoejet/g2p/issues/29#issuecomment-611204418, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJAUFIT5KS3QATUDBWPC2TRLTUBFANCNFSM4MCT2TIQ .

roedoejet commented 4 years ago

It should probably be a parameter in the API, like for orthography conversion you usually want OOVs to pass through, whereas for G2P you may want an error flag of some sort to tell the caller there's something strange about the input, so that it can try some other way of getting a pronunciation for that token.

Yes, that makes sense. I guess we'll just add a custom exception for that? Any other way you want it handled? What should we call the API parameter? strict something or other?

This relates to another thing, does gi-to-pi know the input and output vocabularies of each mapping, so to be able to identify when improper inputs are submitted or improper outputs are generated?

No, it's currently not required to provide an inventory. I think we could do this, but it would be nice to not have to do it in the mapping files themselves, just so we don't have unnecessary rules (ie x -> x) making the mappings large. I can see adding an inventory key to each mapping in the config though and then pointing that to a separate csv or json or something also in the folder. Then we could also define normalization mappings that could handle basic normalization for OOV characters.