jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

中英混合句子分詞嗰陣會將所有英文單詞連埋一齊 #32

Closed laubonghaudoi closed 1 year ago

laubonghaudoi commented 1 year ago

輸入

import pycantonese
pycantonese.pos_tag(pycantonese.segment("我今晚會 have dinner at home"))

輸出係

[('我', 'PRON'), ('今晚', 'ADV'), ('會', 'AUX'), ('havedinnerathome', 'VERB')]

可以睇到 havedinnerathome 成個變成咗一個動詞。如果想還原句子就做唔到。可唔可以喺保留英文單詞之間空格嘅前提下分詞?

jacksonllee commented 1 year ago

可唔可以喺保留英文單詞之間空格嘅前提下分詞?

呢個技術上應該可以喺 pycantonese.segment 度做到,不過跟住你想用 pycantonese.pos_tag,就會有另一個問題:

In [1]: import pycantonese

In [2]: pycantonese.pos_tag(['我', '今晚', '會', 'have', 'dinner', 'at', 'home'])
Out[2]: 
[('我', 'PRON'),
 ('今晚', 'ADV'),
 ('會', 'AUX'),
 ('have', 'VERB'),
 ('dinner', 'ADP'),
 ('at', 'ADV'),
 ('home', 'VERB')]

因為 pycantonese 專係處理廣東話,噉啲英文嘅 POS tagging 就會唔啱,呢個喺你嘅情況會唔會係大問題?

如果淨係分詞想保留運用空格而暫時唔理標詞問題住先,我可以睇下點做。

laubonghaudoi commented 1 year ago

唔該晒,即係話我喺做 pycantonese.segment() 之前,要自己用空格分一次英文詞,係唔係?主要係我而家寫緊個 https://github.com/CanCLID/typo-corrector ,需要借助詞性嚟修改啲錯別字,然後將成句話拼返起身,所以需要保留啲英文單詞之間嘅空格。不過啲英文單詞嘅詞性就唔需要好準確,漢字詞嘅詞性準確性要求高啲。網上嘅粵文語料成日會有中英夾雜,所以處理起身有啲麻煩。

jacksonllee commented 1 year ago

即係話我喺做 pycantonese.segment() 之前,要自己用空格分一次英文詞,係唔係?

如果我冇理解錯嘅話,你嘅意思係咪即係類似呢個做法?

In [1]: import pycantonese

In [2]: import itertools

In [3]: user_input = "我今晚會 have dinner at home"

In [4]: list(itertools.chain.from_iterable(pycantonese.segment(x) for x in user_input.split()))
Out[4]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']
laubonghaudoi commented 1 year ago

係嘅冇錯,我後尾自己實現咗。只不過想知pycantonese有冇可能順便做到噉?唔使另外再實現。定係話噉樣將啲單詞連起身有其他目的?

jacksonllee commented 1 year ago

定係話噉樣將啲單詞連起身有其他目的?

pycantonese.segment drops all whitespace in the user input because I ran into this in my own testing:

In [1]: import pycantonese

In [2]: pycantonese.segment('我 今晚會')  # with an accidental space
Out[2]: ['我', ' ', '今晚', '會']  # note: not the behavior in production now. I saw outputs like this and decided to sanitize the user input by removing all whitespace before applying word segmentataion.

Now that I'm looking into the implementation of pycantonese.segment again, I think I see how I can update it to satisfy what both you and I have brought up (i.e., keeping English words as separated in the output per this GitHub issue, as well as what I've just described in this comment re: not showing superfluous, space-only words in the output). Between this week and the next, I should hopefully be able to update pycantonese and make a new release to resolve this issue. Stay tuned!

laubonghaudoi commented 1 year ago

明白嘞,原來係呢個原因。噉至少我哋可以確定,將英文單詞合併成havedinnerathome唔係 intended behavior,而的確係個bug。唔該晒你修好呢個問題,之後個錯別字修正器應該會慳返好多力 :D

jacksonllee commented 1 year ago

I've just resolved this issue by updating the upstream main branch. The new main branch behaves as desired:

In [1]: import pycantonese

In [2]: pycantonese.segment("我今晚會 have dinner at home")
Out[2]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']

In [3]: pycantonese.segment("我今 晚會 have dinner at home")
Out[3]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']

I was thinking of making a new release after resolving this issue, but on second thought, I'm gonna hold it off a bit, because #33 is still up in the air, also because the new Python 3.11 is coming in a month or so and I'd like to wait on its Docker images etc to be available for CI build support.

jacksonllee commented 1 year ago

I forgot to mention that you've been acknowledged in the readme. Thanks for reporting this issue!