Closed JenniferSLyon closed 4 years ago
hi jen,
the problem seems to be the tokenizing. tokenize() currently doesn't take most punctuation as a part of a word token and splits characters at each occasion (look at the resulting tt
object).
as a quick workaround, you could use treetag() which uses TreeTagger's tokenizer instead. i'll have to think about a proper solution to this, because tokenize()'s rules did also have advantages over treetag() in the past with other texts.
alternatively, you should be able to import your texts using any tokenizer available to you by calling readTagged()
(available in koRpus 0.12-1).
Our corpus contains some documents that contain URLs and more rarely sequences of punctuation that cause some of the readability measures to fail.
For a minimum example, I present a length 3 text vector with the first element an example that should work to make sure I am calling the functions correctly. The second example contains a couple of words and a URL and the third example contains a couple of words and a sequence of punctuation. Our actual documents are much longer, but these examples show the issues we have encountered. I then show output from these examples for ARI, flesch.kincaid and FOG, where ARI works in all cases and the last two fail in different ways. flesch.kincaid fails on the first two examples, and works on the third. FOG works the first and third examples.
The transcript:
Automated Readability Index (ARI) Parameters: default Grade: -8.65
Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!
Automated Readability Index (ARI) Parameters: default Grade: 19.71
Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!
Automated Readability Index (ARI) Parameters: default Grade: -5.8
Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!
Flesch-Kincaid Grade Level Parameters: default Grade: -2.62 Age: 2.38
Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!
Gunning Frequency of Gobbledygook (FOG) Parameters: default Grade: 0.8
Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!
Gunning Frequency of Gobbledygook (FOG) Parameters: default Grade: 1.2
Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!
Matrix products: default BLAS: r-project/R-3.6.2/lib/libRblas.so LAPACK: r-project/R-3.6.2/lib/libRlapack.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] koRpus.lang.en_0.1-3 koRpus_0.11-5 sylly_0.1-5
loaded via a namespace (and not attached): [1] compiler_3.6.2 tools_3.6.2 data.table_1.12.8 sylly.en_0.1-3
I hope this information is helpful. Thank you for your time.
Jen