unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

URLs and sequences of punctuation in documents cause some readability measures to fail #22

Closed JenniferSLyon closed 4 years ago

JenniferSLyon commented 4 years ago

Our corpus contains some documents that contain URLs and more rarely sequences of punctuation that cause some of the readability measures to fail.

For a minimum example, I present a length 3 text vector with the first element an example that should work to make sure I am calling the functions correctly. The second example contains a couple of words and a URL and the third example contains a couple of words and a sequence of punctuation. Our actual documents are much longer, but these examples show the issues we have encountered. I then show output from these examples for ARI, flesch.kincaid and FOG, where ARI works in all cases and the last two fail in different ways. flesch.kincaid fails on the first two examples, and works on the third. FOG works the first and third examples.

The transcript:

require("koRpus.lang.en", quietly=T) koRpus::set.kRp.env(lang="en") responses <- c("hi mom", "this fails http://www.thedailybeast.com/articles/2012/06/28/did-chief-justice-roberts-take-a-cue-from-two-centuries-ago.html", "this fails ,,..$%#@") tt <- lapply(responses, koRpus::tokenize, format="obj") ARI(tt[[1]])

Automated Readability Index (ARI) Parameters: default Grade: -8.65

Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!

ARI(tt[[2]])

Automated Readability Index (ARI) Parameters: default Grade: 19.71

Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!

ARI(tt[[3]])

Automated Readability Index (ARI) Parameters: default Grade: -5.8

Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!

flesch.kincaid(tt[[1]]) Hyphenation (language: en) Error in validObject(.Object) : invalid class “kRp.hyphen” object: invalid object for slot "hyphen" in class "kRp.hyphen": got class "NULL", should be or extend class "data.frame" In addition: Warning message: In mean.default(hyph.df$syll, na.rm = TRUE) : argument is not numeric or logical: returning NA flesch.kincaid(tt[[2]]) Hyphenation (language: en) |============================================================= | 88%Error in all.patterns[[word.length]] : subscript out of bounds flesch.kincaid(tt[[3]]) Hyphenation (language: en) |======================================================================| 100%

Flesch-Kincaid Grade Level Parameters: default Grade: -2.62 Age: 2.38

Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!

FOG(tt[[1]]) Hyphenation (language: en)

Gunning Frequency of Gobbledygook (FOG) Parameters: default Grade: 0.8

Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!

FOG(tt[[2]]) Hyphenation (language: en) |========================================================== | 83%Error in all.patterns[[word.length]] : subscript out of bounds FOG(tt[[3]]) Hyphenation (language: en)

Gunning Frequency of Gobbledygook (FOG) Parameters: default Grade: 1.2

Text language: en Warning message: Text is relatively short (<100 tokens), results are probably not reliable!

sessionInfo() R version 3.6.2 (2019-12-12) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS

Matrix products: default BLAS: r-project/R-3.6.2/lib/libRblas.so LAPACK: r-project/R-3.6.2/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] koRpus.lang.en_0.1-3 koRpus_0.11-5 sylly_0.1-5

loaded via a namespace (and not attached): [1] compiler_3.6.2 tools_3.6.2 data.table_1.12.8 sylly.en_0.1-3

I hope this information is helpful. Thank you for your time.

Jen

unDocUMeantIt commented 4 years ago

hi jen,

the problem seems to be the tokenizing. tokenize() currently doesn't take most punctuation as a part of a word token and splits characters at each occasion (look at the resulting tt object).

as a quick workaround, you could use treetag() which uses TreeTagger's tokenizer instead. i'll have to think about a proper solution to this, because tokenize()'s rules did also have advantages over treetag() in the past with other texts.

alternatively, you should be able to import your texts using any tokenizer available to you by calling readTagged() (available in koRpus 0.12-1).