unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

character vector "measure" seems to be ignored by lex.div; Fehler in x[["end"]] : Indizierung außerhalb der Grenzen; Fehler in 1:lastValidIndex : Resultat wäre zu langer Vektor #35

Closed tobleu closed 3 years ago

tobleu commented 3 years ago

I want to calculate lexical diversity with koRpus' lex.div function for different texts. I am using the options "keep.tokens=TRUE, type.index=TRUE"; the texts are relatively short (10-150 words). From time to time I get error messages of this kind:

MTLDMA.char: Calculate MTLD-MA values
  |=====================================                                     |  50%Fehler in 1:lastValidIndex : Resultat wäre zu langer Vektor
Zusätzlich: Warnmeldungen:
1: Text is relatively short (<100 tokens), results are probably not reliable! 
2: MSTTR: Skipped calculation, segment size is 100, but the text has only 70 tokens! 
3: MATTR: Skipped calculation, window size is 100, but the text has only 70 tokens! 
4: In min(which(all.factorEnds > curr.token)) :
  kein nicht-fehlendes Argument für min; gebe Inf zurück

The affected file is here: uF04.txt. It was tagged with TreeTagger before feeding in tag results into lex.div.

While trying to avoid these errors, I run the the same analysis on failed caluclations with different parameters, like these:

keep.tokens=TRUE, type.index=TRUE,measure =c("TTR", "MSTTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD"))

which results in a different error (even with additional window and segment sizes reduced to 20):

MTLD.char: Calculate MTLD values
  |==========================================================================| 100%
Fehler in x[["end"]] : Indizierung außerhalb der Grenzen
Zusätzlich: Warnmeldungen:
1: Text is relatively short (<100 tokens), results are probably not reliable! 
2: MSTTR: Skipped calculation, segment size is 100, but the text has only 70 tokens! 
3: MATTR: Skipped calculation, window size is 100, but the text has only 70 tokens! 

Reducing the set of measures to a minimal set (even just "TTR") still gives the same error messages and all the progress bars for measures, which should not be included.

Unfortunately I can't trace the error, so I need you help. Thanks a lot in advance!

unDocUMeantIt commented 3 years ago

the problem lies in the char argument in combination with a very short text. if you set char=FALSE you should be fine.

the text ist way too short for a proper calculation of data for characteristic plots. the first general warning about texts shorter than 100 tokens is there for a reason.

i think it's perhaps time for me to drop this calculation from the defaults as most users shouldn't need them anyway.

unDocUMeantIt commented 3 years ago

fixed with unDocUMeantIt/koRpus@3e6f04b599617b459f26b9bf8cda7dc5650c621b