Closed smeylan closed 6 years ago
i copied your lexical_diversity.py version and reverted combinatorics to the original impl to avoid the extra install. not sure if these errors are still occurring. maybe these functions are available in math
?
btw i incorporated mtld/hdd in django import script: https://github.com/langcog/childes-db/blob/master/djangoapp/db/childes_db.py#L291
Looked more into this— the offending call is
scipy.special.comb(population, sample)
for large populations (e.g. population = 2643 and sample = 236)... looking at the original McCarthy and Jarvis paper, it's clear that they expect this to be computed on smaller texts. We need to address this because 1) it's affecting a lot of transcripts (23%) (probably higher if you reverted to the naive implementation) 2) there's a age-related bias in whether this error appears
How about we compute this on the first 1000 (?) tokens in each transcript? Is that easy to add in the Django import script?
only on the first 1000 tokens? would that be ok to just toss out everything else
also technically we're not using that specific function for combinatorics as i mentioned in the previous comment
first 1000 is better than nothing? we could do each 1000-token block and take the median
my point with "probably higher if you reverted to the naive implementation" from my previous comment is that the scipy implementation can handle sample sizes and populations at least as large as the function you are currently using (b/c numpy has clever handling of large ints), so there's probably an even higher proportion of transcripts with NaN HDD scores in the current cached version
actually 2% error with scipy
Something not right with the combinatorics code