Closed ghost closed 8 years ago
Thanks for using this library! What exactly do you mean by "tokenizing" in this case? The HKCanCor data incorporated in the library are already word-segmented (e.g., "芝加哥 好 大風", with three words separated by spaces here). Do you mean you are looking for character-segmented data (e.g., "芝 加 哥 好 大 風", with six Chinese characters separated by spaces)?
Sorry for my obscure. Actually I want to do some data mining on the Cantonese lyric and first objective is topic model using LDA. But the python library, nltk, doesn't provide any function of Chinese character cutting. So I wonder if your library can do that. For example, for the follow lyric: 時間轉眼流走人越大越少朋友 And I want to cut the sentence to: 時間/轉眼/流走/人/越/大/越/少/朋友 Thanks a lot!
I see -- so you're talking about the word segmentation problem (tokenization would mean the word boundaries are already indicated in some way and the task would take advantage of such information, which is not your case). PyCantonese doesn't do this at this point. That said, for word segmentation in python, I'm aware of the wordsegment library. It appears to be adaptable for non-English languages, if you have the unigrams and bigrams plus their counts based on some corpus of the language you're working on. I haven't tried it myself though. (PyCantonese could certainly help get the n-grams and counts for the HKCanCor data.)
Great! I have solved this problem! But now I have another. Given a certain string of Chinese character, how to return the jyupting only? For instance, input: 喂遲啲去唔去 output: wai3 ci4 di1 heoi3 m4 heoi3 Thanks a lot!
Conversion between characters and Jyutping is on the to-do list for PyCantonese. Meanwhile, would other online tools for this purpose accomplish what you're after? Looks like a quick google search will return a handful of options. Hope this helps! I'm closing this issue for now, though please do feel free to open an issue for other questions/comments about PyCantonese.
Oh... Seems that your library is the most functional one for Cantonese. But, anyway, thank you vey much.
Your work is really great! I want to do some LDA work on the Cantonese data. Can I use your library to finish the tokenizing work? How? Thanks a lot!