jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
359 stars 39 forks source link

New output style request of pycantonese.characters_to_jyutping #51

Closed hgneng closed 1 month ago

hgneng commented 1 month ago

Feature you are interested in and your specific question(s): I want new output style request of pycantonese.characters_to_jyutping something like this:

>>>pycantonese.characters_to_jyutping('香港人講廣東話', style='single_jyutping_list')
['hoeng1', 'gong2', 'jan4', 'gong2', 'gwong2', 'dung1', 'waa2']

What you are trying to accomplish with this feature or functionality: I want to get a simple chat-to-jyutping list. The original output style is a bit hard to do further processing.

Additional context:

hgneng commented 1 month ago

Currently, I use following code to achievement my purpose:

def _cantonese_character_to_jyutping(text: str) -> List[str]:
    jyutpings = pycantonese.characters_to_jyutping(text)
    ret = []
    for word in jyutpings:
        jyutpingWord = word[1]
        ret.extend(re.findall(r'[a-zA-Z]+[0-9]+', jyutpingWord))

    return ret
jacksonllee commented 1 month ago

Hello! It appears that the output of the characters_to_jyutping function provides additional information (word segmentation, plus the Chinese/Cantonese characters for the given word segmentation) that you're not interested in, and that it's just a few lines of code of your own (which you already have figured out) to post-process the result of characters_to_jyutping for what you want. So I'm not sure if it's worth adding options to characters_to_jyutping as you've suggested.

Alternatively, to get what you would like, combining characters_to_jyutping with the implemented parse_jyutping function would also work (so that you don't have to do regex parsing on your own to break up the Jyutping string by syllables):

In [1]: import pycantonese

In [2]: pycantonese.__version__
Out[2]: '3.4.0'

In [3]: result = []

In [4]: for _, jyutpings in pycantonese.characters_to_jyutping('香港人講廣東話'):
   ...:     for jp in pycantonese.parse_jyutping(jyutpings):
   ...:         result.append(str(jp))
   ...:

In [5]: result
Out[5]: ['hoeng1', 'gong2', 'jan4', 'gong2', 'gwong2', 'dung1', 'waa2']
hgneng commented 1 month ago

Thank you for your reply. It doesn't matter whether the new output style is supported. I just not quite familiar with Python and it costs me a few more minutes to ask AI how to do it. In fact, after some more investigation, I find that I need the original style with word segmentation.