Open AlienKevin opened 2 months ago
The differences are due to two main reasons:
For reference, here are the relevant commits at the pycantonese codebase:
Perhaps it's possible (ideal?) to re-do the CHAT format conversion work by pulling data from https://github.com/fcbond/hkcancor and keep everything (including the preprocessing/conversion code) properly versioned so that we would be able to track these things -- a project for another day :-)
I see, thanks for your detailed answer.
The version of HKCanCor published on HuggingFace by NTU is different from the version offered by this library in at least four undocumented ways:
"一路一路剝.
that starts with a quote.Would it be possible to explain these differences and any preprocessing step done by PyCantonese in the doc somewhere?
You can use the following script to compare the two versions of the corpus:
Outputs: