Closed joanise closed 4 years ago
Thanks, I would be surprised if my hastily made Danish rules would account for the UDHR, but this specific problem is fixed now anyways: 2eaf2e5
I realize this specific problem is fixed, but is there a way to make g2p not pass through unconverted input characters? It seems to come up repeatedly with various languages...
Hm. Currently not, but yes, I could see this being useful for debugging. I think it would get annoying outside of debugging - there are lots of characters that don't get typically converted (punctuation, numbers etc..). And, I think it's better to get an error like UW N D EH Y T rUW K K EH L S EH
, than to silently not pass it through and get UW N D EH Y T UW K K EH L S EH
instead which would be wrong and be ultimately harder to debug. Maybe -debug
could give a list of untouched characters too, instead of just not letting them pass through?
It should probably be a parameter in the API, like for orthography conversion you usually want OOVs to pass through, whereas for G2P you may want an error flag of some sort to tell the caller there's something strange about the input, so that it can try some other way of getting a pronunciation for that token.
This relates to another thing, does gi-to-pi know the input and output vocabularies of each mapping, so to be able to identify when improper inputs are submitted or improper outputs are generated? The original G2P subsystem in readalongs had both inventories and mappings (with inventories being generated from the mapping if no explicit inventory file was provided, which was usually the case, eng being the notable exception). That wasn't used for catching errors (although it should have been, we just never got to it), but it was used in systems like the language id system.
On Wed, Apr 8, 2020, 5:28 PM Aidan Pine notifications@github.com wrote:
Hm. Currently not, but yes, I could see this being useful for debugging. I think it would get annoying outside of debugging - there are lots of characters that don't get typically converted (punctuation, numbers etc..). And, I think it's better to get an error like UW N D EH Y T rUW K K EH L S EH, than to silently not pass it through and get UW N D EH Y T UW K K EH L S EH instead which would be wrong and be ultimately harder to debug. Maybe -debug could give a list of untouched characters too, instead of just not letting them pass through?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/roedoejet/g2p/issues/29#issuecomment-611204418, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJAUFIT5KS3QATUDBWPC2TRLTUBFANCNFSM4MCT2TIQ .
It should probably be a parameter in the API, like for orthography conversion you usually want OOVs to pass through, whereas for G2P you may want an error flag of some sort to tell the caller there's something strange about the input, so that it can try some other way of getting a pronunciation for that token.
Yes, that makes sense. I guess we'll just add a custom exception for that? Any other way you want it handled? What should we call the API parameter? strict
something or other?
This relates to another thing, does gi-to-pi know the input and output vocabularies of each mapping, so to be able to identify when improper inputs are submitted or improper outputs are generated?
No, it's currently not required to provide an inventory. I think we could do this, but it would be nice to not have to do it in the mapping files themselves, just so we don't have unnecessary rules (ie x -> x
) making the mappings large. I can see adding an inventory
key to each mapping in the config though and then pointing that to a separate csv or json or something also in the folder. Then we could also define normalization mappings that could handle basic normalization for OOV characters.
The "r" in "frygt" incorrectly stays as is in the eng-arpabet output.
This causes errors in
readalongs align
with the Danish UDHR.A similar error occurs with "undertrykkelse":
Source for these two words: https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=dns