jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

Optimize tagger logic using numpy #35

Open ZhanruiLiang opened 1 year ago

ZhanruiLiang commented 1 year ago

For #33. Use numpy to vectorize computations instead of doing vector/matrix computations using dict. The observed performance gain is around 3x, tested under the use case of https://github.com/CanCLID/typo-corrector. train_tagger.py shows similar performance improvement.

To make the tests pass, I had to change a bug about prev, prev2 = self.START. It should really be prev2, prev = self.START as prev2 is supposed be the tag of the i-2th word.

Also had to increase the training interation count to make tests pass. The test 135 is changed to 136 as 5 happens to have some weights in the trained model. The iteration count is now 50% and the accuracy is over 98%.

ZhanruiLiang commented 1 year ago

Doc test is failing with:

Example at /home/circleci/project/docs/source/parsing.rst, line 26, column 1 did not evaluate as expected:
Expected:
    *X:    你         食         咗        飯          未        呀        ?
    %mor:  PRON|nei5  VERB|sik6  PART|zo2  NOUN|faan6  ADV|mei6  PART|aa4  ?
    <BLANKLINE>
    *X:    食         咗        喇         !
    %mor:  VERB|sik6  PART|zo2  PART|laa1  !
    <BLANKLINE>
    *X:    你         聽日           得         唔      得閒           呀        ?
    %mor:  PRON|nei5  ADV|ting1jat6  VERB|dak1  ADV|m4  ADJ|dak1haan4  PART|aa4  ?
    <BLANKLINE>
Got:
    *X:    你         食         咗        飯          未        呀        ?
    %mor:  PRON|nei5  VERB|sik6  PART|zo2  NOUN|faan6  ADV|mei6  PART|aa4  ?
    <BLANKLINE>
    *X:    食         咗        喇         !
    %mor:  VERB|sik6  PART|zo2  PART|laa1  !
    <BLANKLINE>
    *X:    你         聽日           得        唔      得閒           呀        ?
    %mor:  PRON|nei5  ADV|ting1jat6  AUX|dak1  ADV|m4  ADJ|dak1haan4  PART|aa4  ?
    <BLANKLINE>

Any suggestion? I'm not sure whether "得" in this context is VERB or AUX.

laubonghaudoi commented 1 year ago

我覺得個 AUX 肯定係唔啱嘅,但係亦都唔係 VERB,唔知 @jacksonllee 點睇。我唔係瞭解 PyCantonese 嘅標準,我覺得應該係將「得」視作「得閒」嘅縮略所以都係標成 ADJ?

jacksonllee commented 1 year ago

Hello! Just a heads-up that I see this coming through and I'll be able to take a look at this PR this week. Thanks for the contribution!

laubonghaudoi commented 11 months ago

想問下呢個有冇咩進展?希望可以解決到 https://github.com/jacksonllee/pycantonese/issues/33 噉樣就可以發佈個新版,解決埋 https://github.com/jacksonllee/pycantonese/issues/43

ZhanruiLiang commented 10 months ago

No update since my last comment. As I remember, the problem is that test data are too flaky and the updated code doesn't produce the same output.