Closed basbaccarne closed 7 years ago
thanks for reporting!
well, apart from the really strange tags in the error message (numbers as tags?) this is not how you should tag dutch texts. if you set the preset to "en" (which is "english") but use the dutch tagging script, TreeTag will likely return incompatible tags.
there is a dutch language package: https://reaktanz.de/R/pckg/koRpus.lang.nl/
please let me know if you need assistance to get it working.
With the koRpus.lang.nl package, it is indeed possbile to select nl as a language, but the output still produces the same error
library("koRpus")
library("koRpus.lang.nl")
set.kRp.env(TT.cmd="c://treetagger/bin/tag-dutch.bat", lang="nl", preset="nl")
output <- treetag(text, format="obj", TT.options=list(path="c://TreeTagger", preset="nl"))
token tag lemma
1 Ze 300 ze
2 zien 254 zien
3 wat 400 wat
4 we 300 we
5 liever 154 lief
6 verborgen 219 verborgen
7 houden 210 houden
8 , PUNCT ,
9 brengen 254 brengen
10 orde 000 orde
11 in 600 in
12 onze 333 onze
13 chaos 000 chaos
14 en 700 en
15 zetten 256 zetten
16 ons 303 ons
17 al 500 al
18 eens 500 eens
19 een 450 een
20 spiegel 000 spiegel
21 voor 6105 voor
23 We 300 we
24 geven 254 geven
25 weinigen 441 weinig
26 meer 454 meer
27 inkijk 000 inkijk
28 dan 720 dan
29 de 370 de
30 poetshulp* 010 poetshulp*
32 Ik 000 Ik
33 kijk 247 kijken
35 hun 330 hun
36 ogen 001 oog
38 ik 300 ik
39 weet 251 weten
40 meteen 500 meteen
41 welk 410 welk
42 vlees 000 vlees
46 kuip 000 kuip
47 heb 252 hebben
Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above error dump as a bug report to the package maintaner!
try this (i.e., omit the batch script):
set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")
and also omit the "TT.options" in your treetag()
call, they're already set by set.kRp.env
.
This produces
Error in paste(TT.splitter, "perl ", TT.tokenizer, TT.tknz.opts, TT.call.file, :
object 'TT.call.file' not found
ouch, now you've discovered a genuine bug in treetag()
:-D at least, in the windows version.
i hope i fixed it, see commit 98c405978174f7928dfdead4d5b3ca39467616e6. can you test the package from the "develop" branch as described in this section: https://github.com/unDocUMeantIt/koRpus/tree/develop#installation-via-github ?
if that's not an option i could also build a windows package for testing, if you send me your e-mail address (mine's in the package description).
The object 'TT.call.file' is now found, but the output produces the first error again:
library(devtools)
install_github("unDocUMeantIt/koRpus", ref="develop", force = TRUE)
library("koRpus")
library("koRpus.lang.nl")
set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")
output <- treetag(testobject, format="obj", TT.options=list(path="c://TreeTagger", preset="nl"))
token tag lemma
1 Ze 300 ze
2 zien 254 zien
3 wat 400 wat
4 we 300 we
5 liever 154 lief
6 verborgen 219 verborgen
7 houden 210 houden
8 , PUNCT ,
9 brengen 254 brengen
[...]
41 welk 410 welk
42 vlees 000 vlees
46 kuip 000 <unknown>
47 heb 252 hebben
Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
This is probably due to a missing tag in kRp.POS.tags() and
needs to be fixed. It would be nice if you could forward the
above error dump as a bug report to the package maintaner!
thank you for testing, good to know we're one bug down.
now, that error is really odd. two things:
debug=TRUE
in your treetag()
call and post the output (that is treetag(testobject, format="obj", debug=TRUE)
, you don't need to repeat TT.options
)? it should include the full TreeTagger command that is being executed in the background. you should be able to copy&paste all of this command and run it in a windows cmd.exe shell -- if that returns numbers as tags already, then the problem could be on TreeTagger's side (i.e. your local TreeTagger configuration)oh wait, i think i found the root of the problem -- are you using the dutch2 parameter set trained on the eindhoven corpus? i don't speak dutch, but it looks to me like that one uses a totally different tagset definition: http://tst-centrale.org/images/stories/producten/documentatie/ehc_handleiding_nl.pdf
to check this, can you temporarily replace your parameter file with first alternative from the TreeTagger webpage: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-par-linux-3.2-utf8.bin.gz
That was it!
This version of dutch-utf8.par
works like charm.
What a relief, thank you for solving this problem and the quick responses.
ok, that's a relief :-) i'm closing the issue, then.
btw, if you feel up for it, you could contribute the missing tag definitions to the koRpus.lang.nl package, so users can use both parameter files. it's not really complicated -- let me know and i'll walk you through the process.
This might be interesting to incorporate in one of out student projects. Can you contact me @ bastiaan.baccarne@ugent.bemailto:bastiaan.baccarne@ugent.be ?
From: unDocUMeantIt [mailto:notifications@github.com] Sent: donderdag 12 januari 2017 14:31 To: unDocUMeantIt/koRpus koRpus@noreply.github.com Cc: Bastiaan Baccarne Bastiaan.Baccarne@UGent.be; Author author@noreply.github.com Subject: Re: [unDocUMeantIt/koRpus] Invalid tag(s) (#5)
ok, that's a relief :-) i'm closing the issue, then.
btw, if you feel up for it, you could contribute the missing tag definitions to the koRpus.lang.nl package, so users can use both parameter files. it's not really complicated -- let me know and i'll walk you through the process.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/unDocUMeantIt/koRpus/issues/5#issuecomment-272163470, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AG3-bcJ-9qVuGcNL4ZFc-5GQeyfsx2hjks5rRisYgaJpZM4LgrJb.
When I try to run the following code on a (Dutch) character object I get the error below:
library("koRpus")
set.kRp.env(TT.cmd="c://treetagger/bin/tag-dutch.bat", lang="en", preset="en")
output <- treetag(text, format="obj", TT.options=list(path="c://TreeTagger", preset="en"))
Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed.
It would be nice if you could forward the above error dump as a bug report to the package maintaner!
I tried different solutions but none of them got me closer to a solution. I don't know if this is a problem related to my own configuration or the package code, but since the error message asked to forward the error dump as a bug report, I'm putting it here.