jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

Looking for "name conversion" #20

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hello,

please excuse any ignorance on the topic of romanization of Cantonese, as I neither know the language nor it's pronunciation rules. My use case is merely transliterating actor and role names for movie and drama series purposes.

I've tried several packages, but I keep seeing the same differences. For these romanizations, tones are not used, but this is minor. Right now I have only one example at hand, because I don't use it often and forgot about the earlier cases, but I will try to find more if this is not a "structural" problem, but specific to the character.

The example is Bruce Leung / Leung Siu Lung which comes up as:

In [4]: pc.characters2jyutping('梁小龍')
Out[4]: ['loeng4', 'siu2', 'lung4']

The loeng4 versus leung puzzles me and I can also not find any sources / documentation that could explain the difference to me, because basically I don't know what to search for :) As a speaker of Dutch, English and German, in none of the languages I would pronounce loeng or leung the same.

Are these name conversions based on different rules or even a different system entirely?

jacksonllee commented 3 years ago

Hello @vylmen, there are indeed multiple romanization schemes going on here. For the surname 梁, both "loeng4" and "Leung" are intended to denote the same pronunciation for Hong Kong Cantonese.

For the purpose of transliterating actor and actress names, if you want accurate results (in the sense of matching what the respective actor/actress actually uses in real life), then your best bet is probably to do some lookup through Wikipedia (like you did for the actor 梁小龍) due to the unpredictable nature of such romanization; programmatically, I imagine it's possible to grab a Wikipedia page's HTML and target the appropriate tags and attributes to retrieve the romanized name.

If scraping Wikipedia is not a good option for you -- Based on your description, am I right in thinking you'd want something intuitive and accessible for non-linguists (i.e., you'd not want something technical like the IPA)? I could see how the Jyutping system might be confusing (things like "z", "c", "j", etc. in Jyutping are not what one would expect from an English pronunciation perspective). Anecdotally, native speakers of English attempting to learn Cantonese have found the Yale romanization system more English-friendly. If you'd like to give it a shot, PyCantonese has a jyutping2yale function.

Please let me know if you need more information!

ghost commented 3 years ago

Thank you very much for the detailed explanation.

It's sort of what I was expecting. As with Korean, there have been multiple systems and people born in the era of the older system, use that system as changing their name both looks different and is expensive (passports, birth certificates, etc). And it certainly doesn't help if the system is not consistent itself.

What I want to use it for is to prevent page duplication. Pages are created with romanized URLs and titles to aid non-native speakers in finding the page. Sometimes, a person exists under his/her Cantonese name, but featured in a Mandarin title and duplicate pages are created. This is compounded by the fact that some are stored as their English name with either Cantonese or Mandarin surname. Additionally if a person plays in both languages (dubbed or not), it's convenient to have both romanizations listed.

And as for the character they play - I would use transliteration here too, if the series/movie is in Cantonese (not currently my personal focus, but that can change). For character names it would seem logical to use Jyutping, but I'll do more research given the sources you quoted on what is sensible. Once again, thanks for the valuable input!

jacksonllee commented 3 years ago

You're welcome! I'm closing this issue for now. If you have more specific questions later, please feel free to open new issues.