unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

Incorrect lemma and tag/wclass for final word when using koRpus::treetag #16

Closed nlanderson9 closed 5 years ago

nlanderson9 commented 5 years ago

When I run koRpus::treetag, the final word always results in an "<unknown>" lemma even if it should be known. Also, the word doesn't appear to be consistently classified correctly according to POS.

text = "This is a test"
test = treetag(text, treetagger = "manual", format = "obj", TT.tknz=FALSE, lang="en", TT.options = list(path="./TreeTagger", preset="en"))

#   doc_id token tag     lemma lttr     wclass desc stop stem idx sntc
# 1   <NA>  This  DT      this    4 determiner   NA   NA   NA   1   NA
# 2   <NA>    is VBZ        be    2       verb   NA   NA   NA   2   NA
# 3   <NA>     a  DT         a    1 determiner   NA   NA   NA   3   NA
# 4   <NA> test   NN <unknown>    5       noun   NA   NA   NA   4   NA

If the final word is replaced with something else, the same result happens to the new final word (whereas the previous final word now works fine). In this example, again is classified as a noun.

text = "This is a test again"

#   doc_id  token tag     lemma lttr     wclass desc stop stem idx sntc
# 1   <NA>   This  DT      this    4 determiner   NA   NA   NA   1   NA
# 2   <NA>     is VBZ        be    2       verb   NA   NA   NA   2   NA
# 3   <NA>      a  DT         a    1 determiner   NA   NA   NA   3   NA
# 4   <NA>   test  NN      test    4       noun   NA   NA   NA   4   NA
# 5   <NA> again   NN <unknown>    6       noun   NA   NA   NA   5   NA

I'm using:

sessionInfo()
# R version 3.5.2 (2018-12-20)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS Mojave 10.14.2

packageVersion("koRpus")
# [1] ‘0.11.5’

It looks like it may be adding an extra space to the end of the final word (based on the token output and lttr result), but I don't know if this is causing the issue.

unDocUMeantIt commented 5 years ago

i can't replicate this on linux, where i get this:

  doc_id token tag lemma lttr     wclass desc stop stem idx sntc
1   <NA>  This  DT  this    4 determiner   NA   NA   NA   1   NA
2   <NA>    is VBZ    be    2       verb   NA   NA   NA   2   NA
3   <NA>     a  DT     a    1 determiner   NA   NA   NA   3   NA
4   <NA>  test  NN  test    4       noun   NA   NA   NA   4   NA

can you please set debug=TRUE and post the content of the resulting test object again? it should contain the raw results of TreeTagger, so we can try to narrow down the issue.

nlanderson9 commented 5 years ago

Hopefully this is what you're looking for! Thanks for your help.

text = "This is a test"
test = treetag(text, treetagger = "manual", format = "obj", TT.tknz=FALSE, lang="en", debug=TRUE, TT.options = list(path="./TreeTagger", preset="en"))
# split=[[:space:]]
#         ign.comp=-
#         heuristics=abbr
#         heur.fix=c("’", "'"), c("’", "'")
#         sentc.end=., !, ?, ;, :
#         detect=FALSE, FALSE
#         clean.raw=
#         perl=FALSE
#         stopwords=
#         stemmer=
#  
#         TT.tokenizer:  koRpus::tokenize() 
#               tempfile: /var/folders/p_/57zxn85107lbm6p75fpfpf080000gn/T//RtmpTcx2XX/tokenize82d91a849087.txt 
#         file:  /var/folders/p_/57zxn85107lbm6p75fpfpf080000gn/T//RtmpTcx2XX/tempTextFromObject82d93dd7ac99.txt 
#         TT.lookup.command:   
#         TT.pre.tagger:  grep -v '^$' | 
#         TT.tagger:  ./TreeTagger/bin/tree-tagger 
#         TT.opts:  -token -lemma -sgml -pt-with-lemma -quiet 
#         TT.params:  ./TreeTagger/lib/english.par 
#         TT.filter.command:  | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;' 
# 
#         sys.tt.call:  cat  /var/folders/p_/57zxn85107lbm6p75fpfpf080000gn/T//RtmpTcx2XX/tokenize82d91a849087.txt |  grep -v '^$' | ./TreeTagger/bin/tree-tagger -token -lemma -sgml -pt-with-lemma -quiet ./TreeTagger/lib/english.par | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;' 

The output of test is:

#      token   tag   lemma      
# [1,] "This"  "DT"  "this"     
# [2,] "is"    "VBZ" "be"       
# [3,] "a"     "DT"  "a"        
# [4,] "test " "NN"  "<unknown>"
unDocUMeantIt commented 5 years ago

yes, thanks.

hm, this actually looks like a TreeTagger problem. to veryify this, could you repeat the call including debug=TRUE and -- without closing the current R session -- run the commands right after sys.tt.call: in the terminal (beginning with cat)? if that also returns <unknown> for the last lemma, then this needs to be fixed on the TreeTagger level, not koRpus.

nlanderson9 commented 5 years ago

Yep, it does look like a TreeTagger issue. If I run:

cat  /var/folders/p_/57zxn85107lbm6p75fpfpf080000gn/T//RtmpTcx2XX/tokenize82d977159114.txt |  grep -v '^$' | ./TreeTagger/bin/tree-tagger -token -lemma -sgml -pt-with-lemma -quiet ./TreeTagger/lib/english.par | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'

the output I get is:

This    DT      this
is      VBZ     be
a       DT      a
test    NN      <unknown>

Thank you for your help in working through this!

unDocUMeantIt commented 5 years ago

i'll close this issue, since there's nothing to be done with regards to koRpus.

maybe updating your TreeTagger installation can help.