About token - Githubissues

jacksonllee / pycantonese

Cantonese Linguistics and NLP

https://pycantonese.org

MIT License

359 stars 39 forks source link

About token #12

Closed ghost closed 8 years ago

ghost commented 8 years ago

Your work is really great! I want to do some LDA work on the Cantonese data. Can I use your library to finish the tokenizing work? How? Thanks a lot!

jacksonllee commented 8 years ago

Thanks for using this library! What exactly do you mean by "tokenizing" in this case? The HKCanCor data incorporated in the library are already word-segmented (e.g., "芝加哥好大風", with three words separated by spaces here). Do you mean you are looking for character-segmented data (e.g., "芝加哥好大風", with six Chinese characters separated by spaces)?

ghost commented 8 years ago

Sorry for my obscure. Actually I want to do some data mining on the Cantonese lyric and first objective is topic model using LDA. But the python library, nltk, doesn't provide any function of Chinese character cutting. So I wonder if your library can do that. For example, for the follow lyric: 時間轉眼流走人越大越少朋友 And I want to cut the sentence to: 時間/轉眼/流走/人/越/大/越/少/朋友 Thanks a lot!

jacksonllee commented 8 years ago

I see -- so you're talking about the word segmentation problem (tokenization would mean the word boundaries are already indicated in some way and the task would take advantage of such information, which is not your case). PyCantonese doesn't do this at this point. That said, for word segmentation in python, I'm aware of the wordsegment library. It appears to be adaptable for non-English languages, if you have the unigrams and bigrams plus their counts based on some corpus of the language you're working on. I haven't tried it myself though. (PyCantonese could certainly help get the n-grams and counts for the HKCanCor data.)

ghost commented 8 years ago

Great! I have solved this problem! But now I have another. Given a certain string of Chinese character, how to return the jyupting only? For instance, input: 喂遲啲去唔去 output: wai3 ci4 di1 heoi3 m4 heoi3 Thanks a lot!

jacksonllee commented 8 years ago

Conversion between characters and Jyutping is on the to-do list for PyCantonese. Meanwhile, would other online tools for this purpose accomplish what you're after? Looks like a quick google search will return a handful of options. Hope this helps! I'm closing this issue for now, though please do feel free to open an issue for other questions/comments about PyCantonese.

ghost commented 8 years ago

Oh... Seems that your library is the most functional one for Cantonese. But, anyway, thank you vey much.