jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

Segmenter removes space of English words in code-mixed sentence #43

Open shivanraptor opened 11 months ago

shivanraptor commented 11 months ago

Describe the bug Segmenter removes space of English words in code-mixed sentence, for example this sentence:

這是Career Centre

To reproduce Here is the code:

import pycantonese
from pycantonese.word_segmentation import Segmenter
segmenter = Segmenter()
pyseg = pycantonese.segment("這是Career Centre", cls=segmenter)
for word in pyseg:
    print(word)

The output is:

這是
CareerCentre

Expected behavior The expected output is:

這是
Career Centre

or

這是
Career
Centre

System (please complete the following information):

shivanraptor commented 11 months ago

After a dig in the old issues, I thought this issue was fixed in https://github.com/jacksonllee/pycantonese/issues/32#issuecomment-1268983221, but it isn't.

laubonghaudoi commented 11 months ago

主要係因為呢個 https://github.com/jacksonllee/pycantonese/pull/35 未解決所以一直都未發佈更新。

shivanraptor commented 10 months ago

I guess I have to wait then.

pengzhendong commented 3 months ago

You can replace the space with some uncommon punctuations, such as "▁". And then skip it.

https://github.com/pengzhendong/g2p-mix/blob/dd19bee513cc13230c41ef66e479de695afa0e2c/g2p_mix/g2p_mix.py#L43