jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

Anyway to off the segmentation and just do the jyutping char by char? #24

Closed cwlinghk closed 3 years ago

cwlinghk commented 3 years ago

The speed is a bit slow, and I am just looking for jyutping, may be the most frequent jyuping of the character? Thanks

jacksonllee commented 3 years ago

Hello, I imagine you're referring to the characters_to_jyutping function? There's no way to turn off word segmentation done as part of this function, since there's ambiguity (like the 蛋 example in the linked docs) and getting the (hopefully correct) word segmentation is the right way to resolve it; using straight-up char+jyutping frequency would be suboptimal.

The slowness is due to how the data is parsed and loaded the first time you call characters_to_jyutping (working on mitigating this issue soon...), though the data is cached in a given Python session, and any subsequent characters_to_jyutpingcalls should be much faster.

Let me know if you have other questions!