Closed nlanderson9 closed 5 years ago
i can't replicate this on linux, where i get this:
doc_id token tag lemma lttr wclass desc stop stem idx sntc
1 <NA> This DT this 4 determiner NA NA NA 1 NA
2 <NA> is VBZ be 2 verb NA NA NA 2 NA
3 <NA> a DT a 1 determiner NA NA NA 3 NA
4 <NA> test NN test 4 noun NA NA NA 4 NA
can you please set debug=TRUE
and post the content of the resulting test
object again? it should contain the raw results of TreeTagger, so we can try to narrow down the issue.
Hopefully this is what you're looking for! Thanks for your help.
text = "This is a test"
test = treetag(text, treetagger = "manual", format = "obj", TT.tknz=FALSE, lang="en", debug=TRUE, TT.options = list(path="./TreeTagger", preset="en"))
# split=[[:space:]]
# ign.comp=-
# heuristics=abbr
# heur.fix=c("’", "'"), c("’", "'")
# sentc.end=., !, ?, ;, :
# detect=FALSE, FALSE
# clean.raw=
# perl=FALSE
# stopwords=
# stemmer=
#
# TT.tokenizer: koRpus::tokenize()
# tempfile: /var/folders/p_/57zxn85107lbm6p75fpfpf080000gn/T//RtmpTcx2XX/tokenize82d91a849087.txt
# file: /var/folders/p_/57zxn85107lbm6p75fpfpf080000gn/T//RtmpTcx2XX/tempTextFromObject82d93dd7ac99.txt
# TT.lookup.command:
# TT.pre.tagger: grep -v '^$' |
# TT.tagger: ./TreeTagger/bin/tree-tagger
# TT.opts: -token -lemma -sgml -pt-with-lemma -quiet
# TT.params: ./TreeTagger/lib/english.par
# TT.filter.command: | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
#
# sys.tt.call: cat /var/folders/p_/57zxn85107lbm6p75fpfpf080000gn/T//RtmpTcx2XX/tokenize82d91a849087.txt | grep -v '^$' | ./TreeTagger/bin/tree-tagger -token -lemma -sgml -pt-with-lemma -quiet ./TreeTagger/lib/english.par | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
The output of test
is:
# token tag lemma
# [1,] "This" "DT" "this"
# [2,] "is" "VBZ" "be"
# [3,] "a" "DT" "a"
# [4,] "test " "NN" "<unknown>"
yes, thanks.
hm, this actually looks like a TreeTagger problem. to veryify this, could you repeat the call including debug=TRUE
and -- without closing the current R session -- run the commands right after sys.tt.call:
in the terminal (beginning with cat
)? if that also returns <unknown>
for the last lemma, then this needs to be fixed on the TreeTagger level, not koRpus.
Yep, it does look like a TreeTagger issue. If I run:
cat /var/folders/p_/57zxn85107lbm6p75fpfpf080000gn/T//RtmpTcx2XX/tokenize82d977159114.txt | grep -v '^$' | ./TreeTagger/bin/tree-tagger -token -lemma -sgml -pt-with-lemma -quiet ./TreeTagger/lib/english.par | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
the output I get is:
This DT this
is VBZ be
a DT a
test NN <unknown>
Thank you for your help in working through this!
i'll close this issue, since there's nothing to be done with regards to koRpus.
maybe updating your TreeTagger installation can help.
When I run
koRpus::treetag
, the final word always results in an"<unknown>"
lemma even if it should be known. Also, the word doesn't appear to be consistently classified correctly according to POS.If the final word is replaced with something else, the same result happens to the new final word (whereas the previous final word now works fine). In this example,
again
is classified as anoun
.I'm using:
It looks like it may be adding an extra space to the end of the final word (based on the token output and
lttr
result), but I don't know if this is causing the issue.