Implement additional LD measures

quanteda / quanteda.textstats

Textual statistics for quanteda

GNU General Public License v3.0

14 stars 2 forks source link

Implement additional LD measures #27

Open kbenoit opened 5 years ago

kbenoit commented 5 years ago

These would include:

(vocd-)D
HD-D

See McCarthy, Philip M, and Scott Jarvis. 2010. “MTLD, Vocd-D, and HD-D: a Validation Study of Sophisticated Approaches to Lexical Diversity Assessment.” Behavior Research Methods 42(2): 381–92.

Also for testing the implementations

Related to quanteda/quanteda#1508

jiongweilua commented 5 years ago

I built a simple function for computing the D of vocd_d in this commit

Some issues I encountered:

Was trying to find a simple way to sample features from tokens object but it seems tokens_sample only supports sampling documents as the size cannot be > ndoc(x). Is this an issue we ought to fix in tokens_sample?
As 'D' is a parameter to be estimated, I relied on the stats::nls function - is this an okay dependency or must we find an alternative way?

Also see McKee, G., Malvern, D., & Richards, B. (2000). Measuring vocabulary diversity using dedicated software. Literary and linguistic computing, 15(3), 323-338.

Think it's the original paper for vocd-D

kbenoit commented 5 years ago

On the first, you can add this function:

library("quanteda")

tokens_samplefrom <- function(x, size, replace = FALSE) {
    attrs <- attributes(x)
    result <- lapply(unclass(x), sample, size = size, replace = replace)
    attributes(result) <- attrs
    quanteda:::tokens_recompile(result)
}

toks <- tokens(c("a b c d e f", "q r s t u v w x"))
set.seed(100)
tokens_samplefrom(toks, size = 3)
## tokens from 2 documents.
## text1 :
## [1] "b" "f" "c"
## 
## text2 :
## [1] "q" "t" "s"

nls() is fine because it's in the (always loaded) stats package.

jiongweilua commented 5 years ago

Prof. @kbenoit ,

See commit e0f90d0 for my outline code for vocd-D after incorporating tokens_samplefrom and apply, and see commit 67b11c6 for my outline code for hd-D

Would be great if you could:

Review the hd-D code: The formula for hd-D is never explicitly specified in McCarthy & Jarvis (2011) but based on McCarthy & Jarvis (2007), I understood that HD-D := sum_over_all_sampsize(ATTR_sampsize * 1/samp_size) but am not 100% sure
Advise on how I can construct unit tests for vocd-D: Since vocd-D involves sampling, there will be some sampling variability how R samples (even with set.seed) vs the online platforms. My guess is we try large n samples + specifying a tight threshold for how much D can vary?

kbenoit commented 5 years ago

For tests, or examples with anything stochastic, use set.seed().

On the HD-D code, I will return to the LD stuff but if @koheiw and I can agree on the structure of a new function (see https://github.com/quanteda/quanteda/pull/1520#issuecomment-447529304) then this will make writing those functions different (and easier). Let's wait on that issue before I return to this code. However I will try to take a look at the McCarthy & Jarvis (2007) to understand HD-D. I think there is code on the Internet somewhere for this, the vocd software perhaps?

kbenoit commented 5 years ago

Working branch for this is dev-MTLD.

jiongweilua commented 5 years ago

@kbenoit Acknowledged!