unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

Error in russian pos tags: Invalid tag(s) found: Mc---d #18

Closed yaqinwang closed 5 years ago

yaqinwang commented 5 years ago

Hello, I was trying to analyse a russian text by using koRpus: tagged.text <- treetag("sample_text.txt",treetagger="manual",lang="en",TT.options=list(path="~/bin/treetagger/",preset="ru"),doc_id="sample") However, there was an error messege when I ran the command: "Invalid tag(s) found: Mc---d This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above warning dump as a bug report to the package maintaner!"

I've read about a previous issue concerning the russian tags and downloaded the newest release as suggested. However, the problem still exists and I got the above message again.

Is there any solution? Thanks ahead.

unDocUMeantIt commented 5 years ago

if you're using a recent version of koRpus, it is actually not an error but just a warning. i.e., the calculation still finishes, you will only end up with a text object that is missing a global word class for all tokens that were tagged with the unknown tag.

if you could tell me what the tag "Mc---d" tags exactly (what does it stand for?), i will add it to the russian language package to fix the issue.

yaqinwang commented 5 years ago

if you're using a recent version of koRpus, it is actually not an error but just a warning. i.e., the calculation still finishes, you will only end up with a text object that is missing a global word class for all tokens that were tagged with the unknown tag.

if you could tell me what the tag "Mc---d" tags exactly (what does it stand for?), i will add it to the russian language package to fix the issue.

Thank you for your quick reply. I just found out how to fix the problem. The encoding of the input text was by default ANSI, which should be utf-8. I changed the encoding and everything ran smoothly. I looked up in the tagset and the tag "Mc--d" stands for the numeral dative.
If you don't mind, I would also like to ask another naive question. As for the tagged results, there are a lot of tokens that are tagged with the "unknown". I'm wondering whether the tagged unknown word class would affect the result of lexical diversity, such as TTR, MATTR. Thanks again!

unDocUMeantIt commented 5 years ago

the odd thing is that the tag you reported was "Mc---d", but "Mc--d" is both defined and documented (i.e., koRpus knows the tag with two dashes, not three). no idea how TreeTagger came up with that tag.

but you should be fine, the lexical diversity measures don't care for POS tags. you could actually just use tokenize() instead of treetag() if lex.div() is all you want afterwards.

yaqinwang commented 5 years ago

the odd thing is that the tag you reported was "Mc---d", but "Mc--d" is both defined and documented (i.e., koRpus knows the tag with two dashes, not three). no idea how TreeTagger came up with that tag.

but you should be fine, the lexical diversity measures don't care for POS tags. you could actually just use tokenize() instead of treetag() if lex.div() is all you want afterwards.

Yeah, I also found it strange and I supposed that was probably because of the ANSI encoding of the original text (or other reasons?). Anyway the problem was solved and thanks a lot!