Closed j-chim closed 3 years ago
Hello, thank you for the pull request. Where did you actually find the POS tags you're trying to add here? The _MAP
dict already contains all and only the POS tags actually used in the HKCanCor data. None of your added ones are found in the data.
There are some that only exist in the paper/tagging scheme but not in the corpus (the morpheme-related ones, eg "Bg", "g", "Qg"). I think it would make sense to include them for the sake of coverage, although the changes are minimal in practice.
The remaining two are really just edge cases, possibly introduced in v2 of the corpus:
This is based on the corpus hosted on github and I found the same entries in the data downloaded directly from the website.
Thank you for the pointers. I wasn't aware of differences between the HKCanCor included in PyCantonese and that from the fcbond/hkcancor repo, and therefore the {N1, XJA, XO} tags were unknown to me. I got the HKCanCor data about 6 years ago and did a lot of heavy scripting to transform it into the CHAT data format currently used in PyCantonese. Unfortunately, I'm unable to locate whatever I used to do the transformation, and so probably won't be able to update the HKCanCor data here for any inconsistencies.
For all these tags you're adding (both the new tags only found in the "upstream" HKCanCor data, as well as those documented in its paper/website but actually unused in the data), practically they're unlikely to have any effect in part-of-speech tagging, since the POS tagger would never see them in the data anyway. This being said, precisely because these tags have no effect and just sit there in the _MAP
dict, I don't see any harm including them for completeness!
LGTM. Thank you for your pull request again.
@j-chim Just wanted to note that your contribution has been noted in the readme now (I also updated the new X-initial edge case tags to better match what the data would suggest rather than a generic catch-all X):
https://github.com/jacksonllee/pycantonese/commit/85a2a8286c9d24e8cf6ae1621d79ab0ac73b862a
Thanks again!
Hi, I was working with the hkcancor dataset and saw the mapping here. Many thanks for providing this resource!
This PR slightly modifies the _MAP to include edge case POS tags (mostly morpheme-related ones) that I think we should consider as an explicit entry, rather than being bucketed to "X" by default.