Closed aureincm closed 3 years ago
hi,
so far i'm not able to reproduce the issue runnig the example (with two small corrections: i replaced the relative path to the treetagger file and added a missing quote ending the sample text).
can you give some more details on your setup, e.g.
can you examine tagged.results
, does it look like you'd expect? e.g., have a look at tagged.results[["token"]]
, are there any indications of problematic character encodings?
is there a particular reason for not using UTF-8 as the encoding? recent versions of R use UTF-8 as the default encoding, so i don't yet see why you force Latin1
on the character string.
also, as long as you haven't done any custom changes to the tree-tagger-french
script, you should be fine with calling
set.kRp.env(TT.cmd="manual", TT.options=list(path="<PATH TO TREETAGGER>", preset="fr"), lang="fr")
with the proper path replaced, of course ;)
on a side note:
I understand that some data structure was modified. For example, treetag()'s output used to be named tagged.txt@TT.res, and it has been changed to tagged.txt@tokens
that is correct. however, unless you are manipulating those object slots somehow manually, this shouldn't really cause any problems for you. when you call readability()
, it's from the same koRpus
package version as treetag()
. users should actually never have to worry about changing object structures, but use getter/setter methods to access the respective data. when the object structure changes internally, the methods are also updated and you can still use your code and get the expected results.
see for example ?taggedText
for a list of available methods.
Hi there,
thank you for your prompt answer, here are some details/answers to your comments:
can you give some more details on your setup, e.g. operating system
MacOS High Sierra version 10.13.6
version of R
R version 4.0.4 (2021-02-15) -- "Lost Library Book" running with RStudio version 1.1.453
versions of koRpus, koRpus.lang.fr, sylly, and sylly.lang.fr
koRpus version 0.13-5 koRpus.lang.fr version 0.1-2 sylly version 0.1-6 sylly.fr version 0.1-2
version of TreeTagger
TreeTagger for MacOSX version 3.2.3
can you examine tagged.results, does it look like you'd expect? e.g., have a look at tagged.results[["token"]], are there any indications of problematic character encodings?
Yes, it looks ok to me, but just to be sure, I attached a screenshot of tagged.results structure for you to have a look
Also, here is the tagged.results@tokens dataframe:
structure(list(doc_id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "1"), token = c("La", "salive", "est", "un", "liquide", "produit", "par", "des", "glandes", "spéciales", "situées", "à", "plusieurs", "endroits", "dans", "la", "bouche", ".", "Elle", "permet", "d'", "enrober", "les", "aliments", "d'", "eau", "afin", "de", "permettre", "un", "passage", "facile", "dans", "l'", "œsophage", "et", "de", "faciliter", "le", "travail", "de", "digestion", "dans", "l'", "estomac", ".", "La", "salive", "contient", "aussi", "une", "enzyme", "qui", "permet", "de", "commencer", "la", "digestion", "de", "l'", "amidon", "des", "plantes", ".", "Mais", "il", "faut", "que", "l'", "aliment", "qui", "contient", "l'", "amidon", "soit", "cuit", "car", "il", "n'", "est", "pas", "possible", "de", "digérer", "l'", "amidon", "cru", ".", "L'", "amidon", "est", "une", "forme", "de", "sucre", "trop", "gros", "pour", "être", "absorbé", "directement", "dans", "les", "intestins", ".", "Il", "faut", "donc", "séparer", "un", "à", "un", "ses", "composants", ";", "c'", "est", "le", "travail", "de", "cette", "enzyme", "appelée", "amylase", ".", "En", "mastiquant", "du", "pain", ",", "qui", "contient", "de", "l'", "amidon", ",", "pendant", "quelques", "instants", ",", "on", "s'", "aperçoit", "que", "le", "goût", "devient", "sucré", ":", "l'", "amylase", "a", "commencé", "son", "travail", "et", "des", "molécules", "de", "sucre", "sont", "libérées", "."), tag = structure(c(4L, 9L, 27L, 4L, 9L, 25L, 17L, 18L, 9L, 2L, 25L, 17L, 13L, 9L, 17L, 4L, 9L, 65L, 14L, 27L, 17L, 24L, 4L, 9L, 17L, 9L, 7L, 17L, 24L, 4L, 9L, 2L, 17L, 4L, 9L, 7L, 17L, 24L, 4L, 9L, 17L, 9L, 17L, 4L, 9L, 65L, 4L, 9L, 27L, 3L, 4L, 9L, 16L, 27L, 17L, 24L, 4L, 9L, 17L, 4L, 9L, 18L, 9L, 65L, 7L, 14L, 27L, 7L, 4L, 9L, 16L, 27L, 4L, 9L, 30L, 25L, 7L, 14L, 3L, 27L, 3L, 2L, 17L, 24L, 4L, 9L, 25L, 65L, 4L, 9L, 27L, 4L, 9L, 17L, 9L, 3L, 2L, 17L, 24L, 25L, 3L, 17L, 4L, 9L, 65L, 14L, 27L, 3L, 24L, 10L, 17L, 10L, 5L, 9L, 54L, 12L, 27L, 4L, 9L, 17L, 12L, 9L, 2L, 9L, 65L, 17L, 26L, 18L, 9L, 54L, 16L, 27L, 17L, 4L, 9L, 54L, 17L, 13L, 9L, 54L, 14L, 14L, 27L, 7L, 4L, 9L, 27L, 25L, 54L, 4L, 9L, 27L, 25L, 5L, 9L, 7L, 18L, 9L, 17L, 9L, 27L, 25L, 65L), .Label = c("ABR", "ADJ", "ADV", "DET:ART", "DET:POS", "INT", "KON", "NAM", "NOM", "NUM", "PRO", "PRO:DEM", "PRO:IND", "PRO:PER", "PRO:POS", "PRO:REL", "PRP", "PRP:det", "SYM", "VER:cond", "VER:futu", "VER:impe", "VER:impf", "VER:infi", "VER:pper", "VER:ppre", "VER:pres", "VER:simp", "VER:subi", "VER:subp", "word.kRp", "no.kRp", "abbr.kRp", "unk.kRp", "ADP", "AUX", "CCONJ", "DET", "INTJ", "NOUN", "PART", "PRON", "PROPN", "SCONJ", "VERB", "X", "#", "$", "''", "(", ")", ",", ":", "PUN", "PUN:cit", "``", ",kRp", "(kRp", ")kRp", "''kRp", "-kRp", "hon.kRp", "p.kRp", "PUNCT", "SENT", ".kRp", "hoff.kRp"), class = "factor"), lemma = c("le", "salive", "être", "un", "liquide", "produire", "par", "du", "glande", "spécial", "situer", "à", "plusieurs", "endroit", "dans", "le", "bouche", ".", "elle", "permettre", "de", "enrober", "le", "aliment", "de", "eau", "afin", "de", "permettre", "un", "passage", "facile", "dans", "le", "œsophage", "et", "de", "faciliter", "le", "travail", "de", "digestion", "dans", "le", "estomac", ".", "le", "salive", "contenir", "aussi", "un", "enzyme", "qui", "permettre", "de", "commencer", "le", "digestion", "de", "le", "amidon", "du", "plante", ".", "mais", "il", "falloir", "que", "le", "aliment", "qui", "contenir", "le", "amidon", "être", "cuire", "car", "il", "ne", "être", "pas", "possible", "de", "digérer", "le", "amidon", "croire", ".", "le", "amidon", "être", "un", "forme", "de", "sucre", "trop", "gros", "pour", "être", "absorber", "directement", "dans", "le", "intestin", ".", "il", "falloir", "donc", "séparer", "un", "à", "un", "son", "composant", ";", "ce", "être", "le", "travail", "de", "ce", "enzyme", "appelé", "amylase", ".", "en", "mastiquer", "du", "pain", ",", "qui", "contenir", "de", "le", "amidon", ",", "pendant", "quelque", "instant", ",", "on", "se", "apercevoir", "que", "le", "goût", "devenir", "sucrer", ":", "le", "amylase", "avoir", "commencer", "son", "travail", "et", "du", "molécule", "de", "sucre", "être", "libérer", "."), lttr = c(2L, 6L, 3L, 2L, 7L, 7L, 3L, 3L, 7L, 9L, 7L, 1L, 9L, 8L, 4L, 2L, 6L, 1L, 4L, 6L, 2L, 7L, 3L, 8L, 2L, 3L, 4L, 2L, 9L, 2L, 7L, 6L, 4L, 2L, 8L, 2L, 2L, 9L, 2L, 7L, 2L, 9L, 4L, 2L, 7L, 1L, 2L, 6L, 8L, 5L, 3L, 6L, 3L, 6L, 2L, 9L, 2L, 9L, 2L, 2L, 6L, 3L, 7L, 1L, 4L, 2L, 4L, 3L, 2L, 7L, 3L, 8L, 2L, 6L, 4L, 4L, 3L, 2L, 2L, 3L, 3L, 8L, 2L, 7L, 2L, 6L, 3L, 1L, 2L, 6L, 3L, 3L, 5L, 2L, 5L, 4L, 4L, 4L, 4L, 7L, 11L, 4L, 3L, 9L, 1L, 2L, 4L, 4L, 7L, 2L, 1L, 2L, 3L, 10L, 1L, 2L, 3L, 2L, 7L, 2L, 5L, 6L, 7L, 7L, 1L, 2L, 10L, 2L, 4L, 1L, 3L, 8L, 2L, 2L, 6L, 1L, 7L, 8L, 8L, 1L, 2L, 2L, 8L, 3L, 2L, 4L, 7L, 5L, 1L, 2L, 7L, 1L, 8L, 3L, 7L, 2L, 3L, 9L, 2L, 5L, 4L, 8L, 1L), wclass = structure(c(4L, 9L, 13L, 4L, 9L, 13L, 11L, 11L, 9L, 2L, 13L, 11L, 5L, 9L, 11L, 4L, 9L, 24L, 5L, 13L, 11L, 13L, 4L, 9L, 11L, 9L, 7L, 11L, 13L, 4L, 9L, 2L, 11L, 4L, 9L, 7L, 11L, 13L, 4L, 9L, 11L, 9L, 11L, 4L, 9L, 24L, 4L, 9L, 13L, 3L, 4L, 9L, 5L, 13L, 11L, 13L, 4L, 9L, 11L, 4L, 9L, 11L, 9L, 24L, 7L, 5L, 13L, 7L, 4L, 9L, 5L, 13L, 4L, 9L, 13L, 13L, 7L, 5L, 3L, 13L, 3L, 2L, 11L, 13L, 4L, 9L, 13L, 24L, 4L, 9L, 13L, 4L, 9L, 11L, 9L, 3L, 2L, 11L, 13L, 13L, 3L, 11L, 4L, 9L, 24L, 5L, 13L, 3L, 13L, 10L, 11L, 10L, 5L, 9L, 22L, 5L, 13L, 4L, 9L, 11L, 5L, 9L, 2L, 9L, 24L, 11L, 13L, 11L, 9L, 22L, 5L, 13L, 11L, 4L, 9L, 22L, 11L, 5L, 9L, 22L, 5L, 5L, 13L, 7L, 4L, 9L, 13L, 13L, 22L, 4L, 9L, 13L, 13L, 5L, 9L, 7L, 11L, 9L, 11L, 9L, 13L, 13L, 24L), .Label = c("abbreviation", "adjective", "adverb", "article", "pronoun", "interjection", "conjunction", "name", "noun", "numeral", "preposition", "symbol", "verb", "word", "number", "unknown", "adposition", "auxiliary", "determiner", "particle", "other", "punctuation", "comma", "fullstop"), class = "factor"), desc = structure(c(4L, 9L, 27L, 4L, 9L, 25L, 17L, 18L, 9L, 2L, 25L, 17L, 13L, 9L, 17L, 4L, 9L, 63L, 14L, 27L, 17L, 24L, 4L, 9L, 17L, 9L, 7L, 17L, 24L, 4L, 9L, 2L, 17L, 4L, 9L, 7L, 17L, 24L, 4L, 9L, 17L, 9L, 17L, 4L, 9L, 63L, 4L, 9L, 27L, 3L, 4L, 9L, 16L, 27L, 17L, 24L, 4L, 9L, 17L, 4L, 9L, 18L, 9L, 63L, 7L, 14L, 27L, 7L, 4L, 9L, 16L, 27L, 4L, 9L, 30L, 25L, 7L, 14L, 3L, 27L, 3L, 2L, 17L, 24L, 4L, 9L, 25L, 63L, 4L, 9L, 27L, 4L, 9L, 17L, 9L, 3L, 2L, 17L, 24L, 25L, 3L, 17L, 4L, 9L, 63L, 14L, 27L, 3L, 24L, 10L, 17L, 10L, 5L, 9L, 52L, 12L, 27L, 4L, 9L, 17L, 12L, 9L, 2L, 9L, 63L, 17L, 26L, 18L, 9L, 52L, 16L, 27L, 17L, 4L, 9L, 52L, 17L, 13L, 9L, 52L, 14L, 14L, 27L, 7L, 4L, 9L, 27L, 25L, 52L, 4L, 9L, 27L, 25L, 5L, 9L, 7L, 18L, 9L, 17L, 9L, 27L, 25L, 63L), .Label = c("abreviation", "adjective", "adverb", "article", "possessive pronoun (ma, ta, ...)", "interjection", "conjunction", "proper name", "noun", "numeral", "pronoun", "demonstrative pronoun", "indefinite pronoun", "personal pronoun", "possessive pronoun (mien, tien, ...)", "relative pronoun", "preposition", "preposition plus article (au,du,aux,des)", "symbol", "verb conditional", "verb futur", "verb imperative", "verb imperfect", "verb infinitive", "verb past participle", "verb present participle", "verb present", "verb simple past", "verb subjunctive imperfect", "verb subjunctive present", "Word (kRp internal)", "Number (kRp internal)", "Abbreviation (kRp internal)", "Unknown (kRp internal)", "Adposition (universal POS tags)", "Auxiliary (universal POS tags)", "Coordinating conjunction (universal POS tags)", "Determiner (universal POS tags)", "Interjection (universal POS tags)", "Noun (universal POS tags)", "Particle (universal POS tags)", "Pronoun (universal POS tags)", "Proper noun (universal POS tags)", "Subordinating conjunction (universal POS tags)", "Verb (universal POS tags)", "Not assigned a real POS category (universal POS tags)", "Punctuation", "End quote", "Opening bracket", "Closing bracket", "Comma", "punctuation", "punctuation citation", "Quote", "Comma (kRp internal)", "Opening bracket (kRp internal)", "Closing bracket (kRp internal)", "Quote (kRp internal)", "Punctuation (kRp internal)", "Headline begins (kRp internal)", "Paragraph (kRp internal)", "Punctuation (universal POS tags)", "Sentence ending punctuation", "Sentence ending punctuation (kRp internal)", "Headline ends (kRp internal)"), class = "factor"), stop = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), stem = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), idx = 1:163, sntc = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L)), row.names = c(NA, -163L), class = "data.frame")
is there a particular reason for not using UTF-8 as the encoding? recent versions of R use UTF-8 as the default encoding, so i don't yet see why you force Latin1 on the character string.
Over the years, I have encountered many quirks when using R with French text. Forcing Latin1 is the best way I found so far to manage accents properly.
also, as long as you haven't done any custom changes to the tree-tagger-french script, you should be fine with calling set.kRp.env(TT.cmd="manual", TT.options=list(path="
", preset="fr"), lang="fr")
Thanks for the tip, I used to set my wd to TreeTagger's path, but your solution works better indeed :-)
unless you are manipulating those object slots somehow manually, this shouldn't really cause any problems for you. when you call readability(), it's from the same koRpus package version as treetag(). users should actually never have to worry about changing object structures, but use getter/setter methods to access the respective data. when the object structure changes internally, the methods are also updated and you can still use your code and get the expected results. see for example ?taggedText for a list of available methods.
Thanks for this one as well. I used to call tagged.txt@tokens to get a dataframe object with the output results that I could then modify, but I just switched to a neater way using taggedText()
Also, here a couple of functions I tried to run in order to identify the problem Maybe the outputs/error messages will be informative to you....
readability (tagged.results)
Hyphenation (language: fr) Error in if (raw >= 90) { : argument is of length zero In addition: Warning messages: 1: Bormuth: Missing word list, hence not calculated. 2: Dale-Chall: Missing word list, hence not calculated.
ARI (tagged.results)
Automated Readability Index (ARI) Parameters: default Grade:
Text language: fr
lex.div (tagged.results)
Language: "fr" MTLDMA: Calculate MTLD-MA values |====================================================================================================================| 100% TTR.char: Calculate TTR values |====================================================================================================================| 100% C.char: Calculate C values |====================================================================================================================| 100% R.char: Calculate R values |====================================================================================================================| 100% CTTR.char: Calculate CTTR values |====================================================================================================================| 100% U.char: Calculate U values |====================================================================================================================| 100% S.char: Calculate S values |====================================================================================================================| 100% Maas.char: Calculate Maas values |====================================================================================================================| 100% lgV0.char: Calculate lgV0 values |====================================================================================================================| 100% lgeV0.char: Calculate lgeV0 values |====================================================================================================================| 100% K.char: Calculate K values |====================================================================================================================| 100% HDD.char: Calculate HD-D values |====================================================================================================================| 100% MTLD.char: Calculate MTLD values |====================================================================================================================| 100% MTLDMA.char: Calculate MTLD-MA values |====================================================================================================================| 100%Error in 1:lastValidIndex : result would be too long a vector In addition: Warning message: In min(which(all.factorEnds > curr.token)) : no non-missing arguments to min; returning Inf
ok, so all packages are up to date, that's a good start.
it's really odd that you don't get any output from ARI()
, because the error you get from readability()
seems to be triggered by the Flesch formula (trying to look up the grading from the raw value). therefore i assume there is something going really wrong early on in the readability calculations.
so far i don't have any hypothesis what could be the issue. the errors don't look at all familiar.
we'll have to track down the step where it fails. there's various attempts to this:
hyphen()
on the tagged text -- is that still ok or do the problems start here already?devtools::install_github("unDocUMeantIt/koRpus", ref="0.13-1")
hopefully at least one of these steps leads to a different outcome, and we can go on from there.
Hi,
I am sorry it took me so long to reply to your last comment.
Thank for all your debug suggestions, I tried all these tests and none of them was conclusive.
However, I think I found the cause of the bug (even though I don't really understand why...)
When doc_id is set as an integer, it causes ARI() to return no results. Same with readability() and a bunch of other functions such as FOG(), FORECAST(), ...
tagged.results_bug <- treetag(as.character("Les arbres ont l'air plus bleu que vert. Je pense que ce sont des fleurs."), format = "obj", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?"), add.desc = TRUE, doc_id = 1)
ARI (tagged.results_bug)
Automated Readability Index (ARI) Parameters: default Grade:
Text language: fr Warning message: Text is relatively short (<100 tokens), results are probably not reliable!
The bug is resolved when doc_id is set as a character (or not called at all)
tagged.results_ok <- treetag(as.character("Les arbres ont l'air plus bleu que vert. Je pense que ce sont des fleurs."), format = "obj", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?"), add.desc = TRUE, doc_id = as.character(1))
ARI (tagged.results_ok)
Automated Readability Index (ARI) Parameters: default Grade: -0.95
Text language: fr Warning message: Text is relatively short (<100 tokens), results are probably not reliable!
When doc_id is set as an integer, it causes ARI() to return no results. Same with readability() and a bunch of other functions such as FOG(), FORECAST(), ...
so, did you manually set doc_id
to integer values? the sample code from your first post didn't do so.
as the documentation for treetag()
explains, doc_id
is expected be a character string, but so far there were no proper checks for that. i hope this commit fixes the issue: unDocUMeantIt/koRpus@a67dbe7406875088c448ee07f9afa270c56505a0
you can install from the current develop branch to check it out:
devtools::install_github("unDocUMeantIt/koRpus", ref="develop")
Hi there,
thanks for the updated code.
In my original code doc_id
was set to an integer because it corresponds to the iteration of a for{}
loop (I am running treetagger on a large set of texts, each one independently per loop iteration).
When I created my dummy example, I got rid of this parameter for the sake of simplicity, and did not realize that it also eliminated the bug. Sorry about that!
Thanks again for the help though!! And thanks for your great package
In my original code doc_id was set to an integer because it corresponds to the iteration of a for{} loop (I am running treetagger on a large set of texts, each one independently per loop iteration).
check out the tm.plugin.koRpus
package.
Hi there,
first, thanks for this great package! I have been using it for more than a year for my research and it is very useful.
However, since I updated to the latest version, I can't get readability() to run anymore.
I understand that some data structure was modified. For example, treetag()'s output used to be named tagged.txt@TT.res, and it has been changed to tagged.txt@tokens. So maybe, I am simply not calling the kRp.text object properly when using readability()...
Would you have some insight??
Below is a simple reproducible version of my code
Thank you very much for your help!
`
installs and loads Fr package
install.koRpus.lang("fr") library("koRpus.lang.fr")
sets environment for treetagger - FRENCH VERSION
set.kRp.env(TT.cmd = "cmd/tree-tagger-french", lang="fr", format = "obj", encoding = "Latin1")
runs TreeTagger on a short text
tagged.results <- treetag("La salive est un liquide produit par des glandes spéciales situées à plusieurs endroits dans la bouche. Elle permet d'enrober les aliments d'eau afin de permettre un passage facile dans l'œsophage et de faciliter le travail de digestion dans l'estomac. La salive contient aussi une enzyme qui permet de commencer la digestion de l'amidon des plantes. Mais il faut que l'aliment qui contient l'amidon soit cuit car il n'est pas possible de digérer l'amidon cru. L'amidon est une forme de sucre trop gros pour être absorbé directement dans les intestins. Il faut donc séparer un à un ses composants ; c'est le travail de cette enzyme appelée amylase. En mastiquant du pain, qui contient de l'amidon, pendant quelques instants, on s'aperçoit que le goût devient sucré : l'amylase a commencé son travail et des molécules de sucre sont libérées. , format = "obj", apply.sentc.end = TRUE, sentc.end = ".", add.desc = TRUE)
runs readability indexes estimation
readability (tagged.results) `
Here is the ERROR MESSAGE that I get: Error in if (raw >= 90) { : argument is of length zero