jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

Add entries to pos map #22

Closed j-chim closed 3 years ago

j-chim commented 3 years ago

Hi, I was working with the hkcancor dataset and saw the mapping here. Many thanks for providing this resource!

This PR slightly modifies the _MAP to include edge case POS tags (mostly morpheme-related ones) that I think we should consider as an explicit entry, rather than being bucketed to "X" by default.

jacksonllee commented 3 years ago

Hello, thank you for the pull request. Where did you actually find the POS tags you're trying to add here? The _MAP dict already contains all and only the POS tags actually used in the HKCanCor data. None of your added ones are found in the data.

j-chim commented 3 years ago

There are some that only exist in the paper/tagging scheme but not in the corpus (the morpheme-related ones, eg "Bg", "g", "Qg"). I think it would make sense to include them for the sake of coverage, although the changes are minimal in practice.

The remaining two are really just edge cases, possibly introduced in v2 of the corpus:

This is based on the corpus hosted on github and I found the same entries in the data downloaded directly from the website.

jacksonllee commented 3 years ago

Thank you for the pointers. I wasn't aware of differences between the HKCanCor included in PyCantonese and that from the fcbond/hkcancor repo, and therefore the {N1, XJA, XO} tags were unknown to me. I got the HKCanCor data about 6 years ago and did a lot of heavy scripting to transform it into the CHAT data format currently used in PyCantonese. Unfortunately, I'm unable to locate whatever I used to do the transformation, and so probably won't be able to update the HKCanCor data here for any inconsistencies.

For all these tags you're adding (both the new tags only found in the "upstream" HKCanCor data, as well as those documented in its paper/website but actually unused in the data), practically they're unlikely to have any effect in part-of-speech tagging, since the POS tagger would never see them in the data anyway. This being said, precisely because these tags have no effect and just sit there in the _MAP dict, I don't see any harm including them for completeness!

LGTM. Thank you for your pull request again.

jacksonllee commented 3 years ago

@j-chim Just wanted to note that your contribution has been noted in the readme now (I also updated the new X-initial edge case tags to better match what the data would suggest rather than a generic catch-all X):

https://github.com/jacksonllee/pycantonese/commit/85a2a8286c9d24e8cf6ae1621d79ab0ac73b862a

Thanks again!