HDD computation throws errors for long transcripts

langcog / childes-db

A SQL interface for the CHILDES child language corpora

13 stars 5 forks source link

HDD computation throws errors for long transcripts #29

Closed smeylan closed 6 years ago

smeylan commented 6 years ago

Something not right with the combinatorics code

amsan7 commented 6 years ago

i copied your lexical_diversity.py version and reverted combinatorics to the original impl to avoid the extra install. not sure if these errors are still occurring. maybe these functions are available in math?

btw i incorporated mtld/hdd in django import script: https://github.com/langcog/childes-db/blob/master/djangoapp/db/childes_db.py#L291

smeylan commented 6 years ago

Looked more into this— the offending call is scipy.special.comb(population, sample) for large populations (e.g. population = 2643 and sample = 236)... looking at the original McCarthy and Jarvis paper, it's clear that they expect this to be computed on smaller texts. We need to address this because 1) it's affecting a lot of transcripts (23%) (probably higher if you reverted to the naive implementation) 2) there's a age-related bias in whether this error appears

How about we compute this on the first 1000 (?) tokens in each transcript? Is that easy to add in the Django import script?

amsan7 commented 6 years ago

only on the first 1000 tokens? would that be ok to just toss out everything else

also technically we're not using that specific function for combinatorics as i mentioned in the previous comment

smeylan commented 6 years ago

first 1000 is better than nothing? we could do each 1000-token block and take the median

my point with "probably higher if you reverted to the naive implementation" from my previous comment is that the scipy implementation can handle sample sizes and populations at least as large as the function you are currently using (b/c numpy has clever handling of large ints), so there's probably an even higher proportion of transcripts with NaN HDD scores in the current cached version

amsan7 commented 6 years ago

actually 2% error with scipy