unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

Invalid tag(s) #5

Closed basbaccarne closed 7 years ago

basbaccarne commented 7 years ago

When I try to run the following code on a (Dutch) character object I get the error below:

library("koRpus") set.kRp.env(TT.cmd="c://treetagger/bin/tag-dutch.bat", lang="en", preset="en") output <- treetag(text, format="obj", TT.options=list(path="c://TreeTagger", preset="en"))


Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252 This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above error dump as a bug report to the package maintaner!


I tried different solutions but none of them got me closer to a solution. I don't know if this is a problem related to my own configuration or the package code, but since the error message asked to forward the error dump as a bug report, I'm putting it here.

unDocUMeantIt commented 7 years ago

thanks for reporting!

well, apart from the really strange tags in the error message (numbers as tags?) this is not how you should tag dutch texts. if you set the preset to "en" (which is "english") but use the dutch tagging script, TreeTag will likely return incompatible tags.

there is a dutch language package: https://reaktanz.de/R/pckg/koRpus.lang.nl/

please let me know if you need assistance to get it working.

basbaccarne commented 7 years ago

With the koRpus.lang.nl package, it is indeed possbile to select nl as a language, but the output still produces the same error

library("koRpus") library("koRpus.lang.nl") set.kRp.env(TT.cmd="c://treetagger/bin/tag-dutch.bat", lang="nl", preset="nl") output <- treetag(text, format="obj", TT.options=list(path="c://TreeTagger", preset="nl"))


token tag lemma 1 Ze 300 ze 2 zien 254 zien 3 wat 400 wat 4 we 300 we 5 liever 154 lief 6 verborgen 219 verborgen 7 houden 210 houden 8 , PUNCT , 9 brengen 254 brengen 10 orde 000 orde 11 in 600 in 12 onze 333 onze 13 chaos 000 chaos 14 en 700 en 15 zetten 256 zetten 16 ons 303 ons 17 al 500 al 18 eens 500 eens 19 een 450 een 20 spiegel 000 spiegel 21 voor 6105 voor 23 We 300 we 24 geven 254 geven 25 weinigen 441 weinig 26 meer 454 meer 27 inkijk 000 inkijk 28 dan 720 dan 29 de 370 de 30 poetshulp* 010 poetshulp* 32 Ik 000 Ik 33 kijk 247 kijken 35 hun 330 hun 36 ogen 001 oog 38 ik 300 ik 39 weet 251 weten 40 meteen 500 meteen 41 welk 410 welk 42 vlees 000 vlees 46 kuip 000 kuip 47 heb 252 hebben Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252 This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above error dump as a bug report to the package maintaner!

unDocUMeantIt commented 7 years ago

try this (i.e., omit the batch script): set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")

and also omit the "TT.options" in your treetag() call, they're already set by set.kRp.env.

basbaccarne commented 7 years ago

This produces

Error in paste(TT.splitter, "perl ", TT.tokenizer, TT.tknz.opts, TT.call.file, : object 'TT.call.file' not found

unDocUMeantIt commented 7 years ago

ouch, now you've discovered a genuine bug in treetag() :-D at least, in the windows version.

i hope i fixed it, see commit 98c405978174f7928dfdead4d5b3ca39467616e6. can you test the package from the "develop" branch as described in this section: https://github.com/unDocUMeantIt/koRpus/tree/develop#installation-via-github ?

if that's not an option i could also build a windows package for testing, if you send me your e-mail address (mine's in the package description).

basbaccarne commented 7 years ago

The object 'TT.call.file' is now found, but the output produces the first error again:

library(devtools)
install_github("unDocUMeantIt/koRpus", ref="develop", force = TRUE)
library("koRpus")
library("koRpus.lang.nl")
set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")
output <- treetag(testobject, format="obj", TT.options=list(path="c://TreeTagger", preset="nl"))

        token   tag     lemma
1          Ze   300        ze
2        zien   254      zien
3         wat   400       wat
4          we   300        we
5      liever   154      lief
6   verborgen   219 verborgen
7      houden   210    houden
8           , PUNCT         ,
9     brengen   254   brengen
[...]
41       welk   410      welk
42      vlees   000     vlees
46       kuip   000 <unknown>
47        heb   252    hebben
Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
  This is probably due to a missing tag in kRp.POS.tags() and
  needs to be fixed. It would be nice if you could forward the
  above error dump as a bug report to the package maintaner!
unDocUMeantIt commented 7 years ago

thank you for testing, good to know we're one bug down.

now, that error is really odd. two things:

  1. can you send me the file you're tagging for debugging purposes? i would like to replicate the problem on my side.
  2. can you set debug=TRUE in your treetag() call and post the output (that is treetag(testobject, format="obj", debug=TRUE), you don't need to repeat TT.options)? it should include the full TreeTagger command that is being executed in the background. you should be able to copy&paste all of this command and run it in a windows cmd.exe shell -- if that returns numbers as tags already, then the problem could be on TreeTagger's side (i.e. your local TreeTagger configuration)
unDocUMeantIt commented 7 years ago

oh wait, i think i found the root of the problem -- are you using the dutch2 parameter set trained on the eindhoven corpus? i don't speak dutch, but it looks to me like that one uses a totally different tagset definition: http://tst-centrale.org/images/stories/producten/documentatie/ehc_handleiding_nl.pdf

to check this, can you temporarily replace your parameter file with first alternative from the TreeTagger webpage: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-par-linux-3.2-utf8.bin.gz

basbaccarne commented 7 years ago

That was it! This version of dutch-utf8.par works like charm. What a relief, thank you for solving this problem and the quick responses.

unDocUMeantIt commented 7 years ago

ok, that's a relief :-) i'm closing the issue, then.

btw, if you feel up for it, you could contribute the missing tag definitions to the koRpus.lang.nl package, so users can use both parameter files. it's not really complicated -- let me know and i'll walk you through the process.

basbaccarne commented 7 years ago

This might be interesting to incorporate in one of out student projects. Can you contact me @ bastiaan.baccarne@ugent.bemailto:bastiaan.baccarne@ugent.be ?

From: unDocUMeantIt [mailto:notifications@github.com] Sent: donderdag 12 januari 2017 14:31 To: unDocUMeantIt/koRpus koRpus@noreply.github.com Cc: Bastiaan Baccarne Bastiaan.Baccarne@UGent.be; Author author@noreply.github.com Subject: Re: [unDocUMeantIt/koRpus] Invalid tag(s) (#5)

ok, that's a relief :-) i'm closing the issue, then.

btw, if you feel up for it, you could contribute the missing tag definitions to the koRpus.lang.nl package, so users can use both parameter files. it's not really complicated -- let me know and i'll walk you through the process.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/unDocUMeantIt/koRpus/issues/5#issuecomment-272163470, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AG3-bcJ-9qVuGcNL4ZFc-5GQeyfsx2hjks5rRisYgaJpZM4LgrJb.