jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
353 stars 38 forks source link

Jyutping to IPA support #44

Open rjrobben opened 9 months ago

rjrobben commented 9 months ago

Feature you are interested in and your specific question(s):

Is there any method that does jyutping to ipa ? I know there's a jyutping to tipa method now, would be great if also have jyutping to ipa.

What you are trying to accomplish with this feature or functionality: I am currently helping to prepare the data for training the cantonese part of a multilingual pl-bert for the open source StyleTTS2 model. link. We need a grapheme to phoneme library for zh-yue/zh language using the wikipedia dataset.

We have yet to find a good enough quality g2p library, tried espeak-ng, some deep learning library, that fits into the StyleTTS2 format. So we are attempting to use the pycantonese characters_to_jyutping method, then convert from jyutping_to_ipa.

Additional context:

jacksonllee commented 9 months ago

Hello, thank you for reaching out here! Coincidentally, a while ago other colleagues also asked about a Jyutping-to-IPA conversion function. I then managed to put together a draft implementation for jyutping_to_ipa() and make it available at a branch of this repo. For now, to use this new function, one would have to pip-install pycantonese from the GitHub source as follows:

pip install git+https://github.com/jacksonllee/pycantonese.git@jyutping-to-ipa

Sample usage:

>>> import pycantonese
>>> pycantonese.jyutping_to_ipa('gwong2dung1waa2')  # 廣東話 Cantonese
['kʷɔŋ25', 'tʊŋ55', 'waː25']
>>> pycantonese.jyutping_to_ipa('gwong2dung1waa2', as_list=False)
'kʷɔŋ25 tʊŋ55 waː25'

For details such as Jyutping-to-IPA mapping tables, customization, and documentation notes, please see the source code of the branch: https://github.com/jacksonllee/pycantonese/compare/jyutping-to-ipa.

Hope this helps!

rjrobben commented 8 months ago

Hello! Great thanks for your reply. It really helps a lot in preparing the dataset!

The added function jyutping_to_ipa() works most of the time, but we just encounter this error while parsing the wikipedia zh-yue dataset, wonder if you have any insights into the issue:

File ~/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pycantonese/jyutping/ipa.py:133, in jyutping_to_ipa(jp_str, as_list, onsets, nuclei, codas, tones)
     75 def jyutping_to_ipa(
     76     jp_str: str,
     77     as_list: bool = True,
   (...)
     82     tones: Optional[Dict[str, str]] = None,
     83 ) -> Union[List[str], str]:
     84     """Convert Jyutping romanization into IPA.
     85 
     86     The Jyutping-to-IPA mapping is based on Matthews and Yip (2011: 461-463).
   (...)
    131     ['tsʰi˥']
    132     """
--> 133     jp_parsed_list = parse_jyutping(jp_str)
    134     ipa_list = []
    136     for jp_parsed in jp_parsed_list:

File ~/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pycantonese/jyutping/parse_jyutping.py:168, in parse_jyutping(jp_str)
    165     onset = cv
    167     if onset not in ONSETS:
--> 168         raise ValueError("onset error -- " + repr(jp))
    170     jp_parsed_list.append(Jyutping(onset, nucleus, coda, tone))
    172 return jp_parsed_list

ValueError: onset error -- 'vi1'

It contains some japanese character, not sure if that's the cause of the error.

But for unrecognised characters, usually the method will just return empty string. So, it might not be caused by unrecognised characters?

I would like to thanks again for your kind help!

jacksonllee commented 8 months ago

The stack trace shows that the error was ValueError: onset error -- 'vi1' raised by the pycantonese.parse_jyutping function when it's called within the new jyutping_to_ipa function. This means your input string to jyutping_to_ipa contains "v1", but "v" isn't valid in Jyutping romanization. You'll have to handle this "v" in some way, either by skipping this offending string or replacing this "v" with a legal Jyutping onset. You'll have to decide if the following suggestion makes sense in your context, but one possibility is to replace "vi1" with "fi1", especially if your data contains code-mixing between Cantonese and another language and a "v" in the source language would most likely be phonetically realized as an "f" in Cantonese.

CharlotteHann commented 1 month ago

Hi Jackson! Just wondering what the numbers represent after the IPA transcription?

jacksonllee commented 1 month ago

Just wondering what the numbers represent after the IPA transcription?

The numbers represent tone using the Chao tone letters. For instance, "55" means the high-level tone, i.e., tone 1 in Jyutping.

CharlotteHann commented 1 month ago

Thanks for your reply! I am currently doing something similar to rjrobben, however trying to map articulatory features to each phonemes, for Jyutping, there is the parse_jyutping method to seperate the phonemes into onset, nucleus and coda. I am wondering if you would have something similar for IPA?

jacksonllee commented 1 month ago

for Jyutping, there is the parse_jyutping method to seperate the phonemes into onset, nucleus and coda. I am wondering if you would have something similar for IPA

If I understand what you're trying to do, it can be done in a two-step process: (1) use parse_jyutping to break up Jyutping into onset/nucleus/coda/tone, and (2) map each Jyutping onset/nucleus/coda/tone to the equivalent IPA symbol. For (2), there isn't a function that directly does exactly that, but that's essentially what the new jyutping_to_ipa function does under the hood. jyutping_to_ipa uses several Python dictionaries to map Jyutping symbols to IPA (e.g., _ONSETS), so you could grab these dicts and do whatever downstream transformations you might want in your use case.

I should be making a new release of pycantonese soon, so that folks who'd like to use the new jyutping_to_ipa function won't have to install the package from the GitHub source code here.

CharlotteHann commented 1 month ago

Hi Jackson, thanks for your reply again. I have checked _ONSETS, however, the mapping does not seem to match what LSHK suggests in LSHK, which suggests that phonemes 'i', 'o', 'u' and 'e' could have a short or long IPA phoneme. I am wondering if this is taken care of by jyutping_to_ipa?

jacksonllee commented 1 month ago

which suggests that phonemes 'i', 'o', 'u' and 'e' could have a short or long IPA phoneme. I am wondering if this is taken care of by jyutping_to_ipa?

Vowel length is not contrastive for Cantonese (except for the borderline case between Jyutping "aa" and "a", which in my mappings I've used [aː] and [ɐ], respectively, for both differences of vowel length and quality). For basic/canonical/regular IPA transcription, vowel length shouldn't be or at least doesn't need to be part of it. My choice of the exact symbols for Jyutping-to-IPA conversation is based on Matthews and Yip (2011), already documented here. If for whatever reason (e.g., if the transcription you need isn't "basic/canonical/regular" but for, say, showing a specific speaker's speech features) you want to override any of the pre-defined mappings, then jyutping_to_ipa has keyword arguments that allow you to do so, already documented here.

CharlotteHann commented 1 month ago

Thank you very much!