unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

Error with Russian - This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed #13

Closed pwwolff closed 6 years ago

pwwolff commented 6 years ago

I'm trying to run an analysis of Russian literary texts. TreeTagger is installed and works from the command line:

cat ~/Projects/R_RussianNLP/Texts/Chechov.txt | cmd/tree-tagger-russian

When I run it through korPus however, I get the following error messages:

tagged.text <- treetag("./Texts/Chechov.txt", lang="ru", treetagger="manual", TT.options=list(path="~/Downloads/TreeTagger", preset="ru"))

Error: Invalid tag(s) found: P--nsaa, P--msga, P--fsla, Vmip3s-a-e, P--fsaa, P-3msdn, P--nsin, P--fsia, P--nsnn, P--msda, P-----r, Vmip3p-m-e, Afpmpnf, Afpmpgf, Vmip3s-m-e, P--fsna, P-3msnn, P-2-snn, Mc---d, P-2-sgn, P--nsna, P-2-sdn, P-----a, P--msia, P-2-san, P-3nsnn, Vmif3s-m-p, P--fsga, P-1-snn, P-2-pin, Ncmsnnp, Vmgp---a-e, P----dn, P-2-pdn, P---pna, Afpmpaf, Afcmsnf, P-1-sgn, Vmip1p-a-e, P-2-pan, Vmip2p-a-e, Vmip1s-a-e, P--msna, P-2-pnn, P-3-pdn, P---pda, P-1-pan, P---paa, Rc, P-1-san, P-1-pdn, P-2-pgn, P--msaa, P-3msin, P----an, P--nsgn, Vmps-smpfpg, Vmpp-smafeg, Vmpp-smmfeg, P---pga, Vmip3p-a-e, P-3fsnn, Vmpp-p-afen, P---pla, Afpmplf, Vmip1s-m-e, Vmip2s-a-e, P-3fsdn, Afpnsns, Vmgp---m-e, Vmps-sfpfpn, P--msla, Afpfsns, P-3fsin, Afpmsns, P--nsdn, P-3msan, P--nsln, P-3-pan, Vmgs---a-p, P----in, Afpmpif This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above error dump as a bug report to the package maint

Am I doing something wrong here?

unDocUMeantIt commented 6 years ago

you're doing nothing wrong. the problem was reported to me on the devel mailinglist a while ago and i actually got in touch with the maintainer of the Russian TreeTagger parameter file. unfortuneately, i still don't have enough infomation on all possible Russian tags. the list of actually used tags is obviously more elaborate that the official tagset documentation, and we can't implement undocumented tags.

background is that koRpus uses a list of all possible tags to be able to add tag-specific extra information (e.g., a generic word class like noun, verb etc.). when TreeTagger returns tags that koRpus doesn't recognize, it throws the above error. to be more precise, it used to throw it, because this behaviour was changed with the latest development release.

as of koRpus 0.11-3, you only get a warning and koRpus adds "unknown" or "NA" to tags it does not understand. so you are able to keep working with the tagged object, but it will only contain the raw POS tags for those unknown.

note that there have been some changes to the handling of language support in 0.11, i.e., you will have to install the language package koRpus.lang.ru and load it to be able to use russian texts. see ?available.koRpus.lang() and ?install.koRpus.lang() for more info on this, and the README.md for information on how to install 0.11-3.

pwwolff commented 6 years ago

Thanks for the quick reply. Indeed, it makes sense that this should throw a warning rather than an error.

In that case I'll pull the latest dev release and try my luck with that.