quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

Implement additional LD measures #27

Open kbenoit opened 5 years ago

kbenoit commented 5 years ago

These would include:

See McCarthy, Philip M, and Scott Jarvis. 2010. “MTLD, Vocd-D, and HD-D: a Validation Study of Sophisticated Approaches to Lexical Diversity Assessment.” Behavior Research Methods 42(2): 381–92.

Also for testing the implementations

Related to quanteda/quanteda#1508

jiongweilua commented 5 years ago

I built a simple function for computing the D of vocd_d in this commit

Some issues I encountered:

Also see McKee, G., Malvern, D., & Richards, B. (2000). Measuring vocabulary diversity using dedicated software. Literary and linguistic computing, 15(3), 323-338.

Think it's the original paper for vocd-D

kbenoit commented 5 years ago

On the first, you can add this function:

library("quanteda")

tokens_samplefrom <- function(x, size, replace = FALSE) {
    attrs <- attributes(x)
    result <- lapply(unclass(x), sample, size = size, replace = replace)
    attributes(result) <- attrs
    quanteda:::tokens_recompile(result)
}

toks <- tokens(c("a b c d e f", "q r s t u v w x"))
set.seed(100)
tokens_samplefrom(toks, size = 3)
## tokens from 2 documents.
## text1 :
## [1] "b" "f" "c"
## 
## text2 :
## [1] "q" "t" "s"

nls() is fine because it's in the (always loaded) stats package.

jiongweilua commented 5 years ago

Prof. @kbenoit ,

See commit e0f90d0 for my outline code for vocd-D after incorporating tokens_samplefrom and apply, and see commit 67b11c6 for my outline code for hd-D

Would be great if you could:

kbenoit commented 5 years ago

For tests, or examples with anything stochastic, use set.seed().

On the HD-D code, I will return to the LD stuff but if @koheiw and I can agree on the structure of a new function (see https://github.com/quanteda/quanteda/pull/1520#issuecomment-447529304) then this will make writing those functions different (and easier). Let's wait on that issue before I return to this code. However I will try to take a look at the McCarthy & Jarvis (2007) to understand HD-D. I think there is code on the Internet somewhere for this, the vocd software perhaps?

kbenoit commented 5 years ago

Working branch for this is dev-MTLD.

jiongweilua commented 5 years ago

@kbenoit Acknowledged!