Open kbenoit opened 5 years ago
I built a simple function for computing the D of vocd_d in this commit
Some issues I encountered:
tokens_sample
only supports sampling documents as the size
cannot be > ndoc(x)
. Is this an issue we ought to fix in tokens_sample
?stats::nls
function - is this an okay dependency or must we find an alternative way?Also see McKee, G., Malvern, D., & Richards, B. (2000). Measuring vocabulary diversity using dedicated software. Literary and linguistic computing, 15(3), 323-338.
Think it's the original paper for vocd-D
On the first, you can add this function:
library("quanteda")
tokens_samplefrom <- function(x, size, replace = FALSE) {
attrs <- attributes(x)
result <- lapply(unclass(x), sample, size = size, replace = replace)
attributes(result) <- attrs
quanteda:::tokens_recompile(result)
}
toks <- tokens(c("a b c d e f", "q r s t u v w x"))
set.seed(100)
tokens_samplefrom(toks, size = 3)
## tokens from 2 documents.
## text1 :
## [1] "b" "f" "c"
##
## text2 :
## [1] "q" "t" "s"
nls()
is fine because it's in the (always loaded) stats package.
Prof. @kbenoit ,
See commit e0f90d0 for my outline code for vocd-D after incorporating tokens_samplefrom
and apply
, and see commit 67b11c6 for my outline code for hd-D
Would be great if you could:
Review the hd-D code: The formula for hd-D is never explicitly specified in McCarthy & Jarvis (2011) but based on McCarthy & Jarvis (2007), I understood that HD-D := sum_over_all_sampsize(ATTR_sampsize * 1/samp_size) but am not 100% sure
Advise on how I can construct unit tests for vocd-D: Since vocd-D involves sampling, there will be some sampling variability how R samples (even with set.seed
) vs the online platforms. My guess is we try large n samples + specifying a tight threshold for how much D can vary?
For tests, or examples with anything stochastic, use set.seed()
.
On the HD-D code, I will return to the LD stuff but if @koheiw and I can agree on the structure of a new function (see https://github.com/quanteda/quanteda/pull/1520#issuecomment-447529304) then this will make writing those functions different (and easier). Let's wait on that issue before I return to this code. However I will try to take a look at the McCarthy & Jarvis (2007) to understand HD-D. I think there is code on the Internet somewhere for this, the vocd software perhaps?
@kbenoit Acknowledged!
These would include:
See McCarthy, Philip M, and Scott Jarvis. 2010. “MTLD, Vocd-D, and HD-D: a Validation Study of Sophisticated Approaches to Lexical Diversity Assessment.” Behavior Research Methods 42(2): 381–92.
Also for testing the implementations
Related to quanteda/quanteda#1508