roedoejet / g2p

Grapheme-to-Phoneme transductions that preserve input and output indices, and support cross-lingual g2p!
https://g2p-studio.herokuapp.com
Other
119 stars 26 forks source link

fix: use NFC if possible in moh_equiv.json #382

Closed MENGZHEGENG closed 2 weeks ago

MENGZHEGENG commented 3 weeks ago

Pull request template for adding a new language

github-actions[bot] commented 3 weeks ago
CLI load time: 0:00.05
PR head 59e0a0887d69ab47f050caa753b6042c709d1105
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package
codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.18%. Comparing base (077afc2) to head (59e0a08).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #382 +/- ## ======================================= Coverage 93.18% 93.18% ======================================= Files 17 17 Lines 2437 2437 Branches 544 544 ======================================= Hits 2271 2271 Misses 95 95 Partials 71 71 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

joanise commented 3 weeks ago

This PR does not change the repo: the second commit undoes the first one. Please check the commits and push an update with what you really meant to push here. Also, once you have this implemented, let's have Akwiratekha' review the changes to make sure they're appropriate.

MENGZHEGENG commented 3 weeks ago

This PR does not change the repo: the second commit undoes the first one. Please check the commits and push an update with what you really meant to push here. Also, once you have this implemented, let's have Akwiratekha' review the changes to make sure they're appropriate.

Thanks Eric for catching this. I've made a new commit and I will invite Akwiratekha' as a reviewer

joanise commented 3 weeks ago

Hum, I'm looking at the changes, and it looks like this change does not force an update to g2p update, which means this PR is actually a no-op, having no effective change on the mapping.

You might want to know that the NFC or NFD normalization happens when we load the mapping too, which means that just changing it in the input file will not change the results.

Also, in the issue you created for moh, you mention some inconsistent output, but I don't know if it's connected to this PR. When a PR is meant to fix the output for some input, you should add test cases covering that input to confirm the PR really makes the expected change.

If the one line change is the only change in this PR, I would not merge it, unless you can confirm it's fixing something with relevant test cases.

MENGZHEGENG commented 3 weeks ago

Hum, I'm looking at the changes, and it looks like this change does not force an update to g2p update, which means this PR is actually a no-op, having no effective change on the mapping.

You might want to know that the NFC or NFD normalization happens when we load the mapping too, which means that just changing it in the input file will not change the results.

Also, in the issue you created for moh, you mention some inconsistent output, but I don't know if it's connected to this PR. When a PR is meant to fix the output for some input, you should add test cases covering that input to confirm the PR really makes the expected change.

If the one line change is the only change in this PR, I would not merge it, unless you can confirm it's fixing something with relevant test cases.

Hi Eric,

The reason I create the PR is when I proprocessed some Mohawk text, I got the feeling that NFC/NFD conversion is not applied before executing moh_equiv.json. I have some à in NFC form in the text, and it is not converted to àː as desired, but leaves as à.

The other issue comes with working with syllables in Mohawl. For example, when èn is the end of one syllable, it seems it will not trigger moh_equiv.json to convert it to ènː. I don't know how to change it however, so this PR is not related to that issue.

joanise commented 3 weeks ago

Ah, OK, I get it. But this won't fix those problems. For à at the end of a word, the problem is that the context after being [^:] means there must be one more letter. If the mapping should happen when it is word final, context_after will need to be fixed. "context_after": "([^:]|$)" might work? Create unit test cases in g2p/tests/public/data/moh.psv with expected correct output, run them and see them fail, then make the fix, run g2p update and see the tests pass. Then you'll know you have a working solution.

For èn I didn't fully analyse yet, but I would expect it's a similar problem with the context, not with normalization.

MENGZHEGENG commented 3 weeks ago

Hum, I'm looking at the changes, and it looks like this change does not force an update to g2p update, which means this PR is actually a no-op, having no effective change on the mapping.

You might want to know that the NFC or NFD normalization happens when we load the mapping too, which means that just changing it in the input file will not change the results.

Also, in the issue you created for moh, you mention some inconsistent output, but I don't know if it's connected to this PR. When a PR is meant to fix the output for some input, you should add test cases covering that input to confirm the PR really makes the expected change.

If the one line change is the only change in this PR, I would not merge it, unless you can confirm it's fixing something with relevant test cases.

Thanks Eric for the insightful suggestion! I will try it out later, and may change the title of the PR accordingly.

akwiratekha commented 2 weeks ago

Hi hi, I am off today and was yesterday. I will go over the emails and what you need on Thursday. Akwiratékha'

On Jun 25, 2024, at 4:36 PM, MENGZHEGENG @.***> wrote:

Hum, I'm looking at the changes, and it looks like this change does not force an update to g2p update, which means this PR is actually a no-op, having no effective change on the mapping.

You might want to know that the NFC or NFD normalization happens when we load the mapping too, which means that just changing it in the input file will not change the results.

Also, in the issue you created for moh, you mention some inconsistent output, but I don't know if it's connected to this PR. When a PR is meant to fix the output for some input, you should add test cases covering that input to confirm the PR really makes the expected change.

If the one line change is the only change in this PR, I would not merge it, unless you can confirm it's fixing something with relevant test cases.

Thanks Eric for the insightful suggestion! I will try it out later, and may change the title of the PR accordingly.

— Reply to this email directly, view it on GitHub https://github.com/roedoejet/g2p/pull/382#issuecomment-2189921520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS457QQ7L4WI7YNR6UNL2DTZJHIDBAVCNFSM6AAAAABJYBSTYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZHEZDCNJSGA. You are receiving this because your review was requested.

akwiratekha commented 2 weeks ago

Good morning,

For g2p, Aidan had me work on the rules awhile back to update them. I have updated them but me and Aidan never got to having a meeting to update the rules overall. I did have one bug in the code which I can't fix, and then there are some underlying morphological rules that can't be figured out by g2p to generate the correct phoneme. I could go into more details if yous would like, as one of the two morphological problems will need to be correct for TTS with Kawennón:nis.

Also, à: [ɑ̀ː]at the end of a word is almost non-existant, unless you found an example. As for èn: at the end of the word itself is rare, én: is much more common [ʌ̃́ː],

On Jun 25, 2024, at 4:32 PM, Eric Joanis @.***> wrote:

Ah, OK, I get it. But this won't fix those problems. For à at the end of a word, the problem is that the context after being [^:] means there must be one more letter. If the mapping should happen when it is word final, context_after will need to be fixed. "context_after": "([^:]|$)" might work? Create unit test cases in g2p/tests/public/data/moh.psv with expected correct output, run them and see them fail, then make the fix, run g2p update and see the tests pass. Then you'll know you have a working solution.

For èn I didn't fully analyse yet, but I would expect it's a similar problem with the context, not with normalization.

— Reply to this email directly, view it on GitHub https://github.com/roedoejet/g2p/pull/382#issuecomment-2189916053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS457QWZWG4EVGZNBOHOOKLZJHHUTAVCNFSM6AAAAABJYBSTYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZHEYTMMBVGM. You are receiving this because your review was requested.

MENGZHEGENG commented 2 weeks ago

Good morning, For g2p, Aidan had me work on the rules awhile back to update them. I have updated them but me and Aidan never got to having a meeting to update the rules overall. I did have one bug in the code which I can't fix, and then there are some underlying morphological rules that can't be figured out by g2p to generate the correct phoneme. I could go into more details if yous would like, as one of the two morphological problems will need to be correct for TTS with Kawennón:nis. Also, à: [ɑ̀ː]at the end of a word is almost non-existant, unless you found an example. As for èn: at the end of the word itself is rare, én: is much more common [ʌ̃́ː], On Jun 25, 2024, at 4:32 PM, Eric Joanis @.***> wrote: Ah, OK, I get it. But this won't fix those problems. For à at the end of a word, the problem is that the context after being [^:] means there must be one more letter. If the mapping should happen when it is word final, context_after will need to be fixed. "context_after": "([^:]|$)" might work? Create unit test cases in g2p/tests/public/data/moh.psv with expected correct output, run them and see them fail, then make the fix, run g2p update and see the tests pass. Then you'll know you have a working solution. For èn I didn't fully analyse yet, but I would expect it's a similar problem with the context, not with normalization. — Reply to this email directly, view it on GitHub <#382 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS457QWZWG4EVGZNBOHOOKLZJHHUTAVCNFSM6AAAAABJYBSTYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZHEYTMMBVGM. You are receiving this because your review was requested.

Hi Akwiratekha', thank you so much for your feedback! It would be of great help if you and Aidan could find an opportunity to update the rulls of Mohawk in the coming weeks. For the word end case, my process now is to first break the words into syllables and then apply g2p on the syllables. As a result, it leads to cases where à (and the other mentioned IPAa) become the end of the syllable (but not the end of the word).

Thanks!

joanise commented 2 weeks ago

Hi Tim, I'm going to close this PR, as it doesn't actually solve the problem you reported. Once you've found a working solution to the problem, you can reopen a new PR with it.