ppm calculation issues - Githubissues

From Lily Zhu and Rebecca Zhu:

I’m writing to you because some of the numbers on the CHILDES web interface are different from the numbers we calculated using the childesr package. I believe I’ve located the discrepancies between the web interface and my R code, and want to check if I have made incorrect assumptions in my code.

We are interested in obtaining the frequency measure of how often kids and adults within a specific target child age range produce a certain word, respectively. For example, how often do two-year-old's who speak North American English produce the word 'same'? To answer this question, we set the following parameters on the website: Collection = "Eng-NA" Corpus = "All" Speakers = "Target_Child" Word = "same" Ages to include (years) = 2 - 3 Bin size (months) = 12 This produced the result of 770 ppm, which differed from a script we implemented ourselves using the childesr package, which was ~ 159.23 ppm. I believe this difference is due to two discrepancies: The web interface provides the average ppm of each child’s average ppm. However, my R code averages across the totals, because the individual ppms are calculated with different denominators. The web interface does not include children who do not produce the word, but this may be a problem because the number does not reflect the production frequency for all kids within the desired age range. It seems like in writing my R code, I made two assumptions that differed from the web interface. Please see the attached files for more details. Can you tell me why you might have made those assumptions, or if I’m wrong? Our team is not well-versed in the computational linguistics literature, so we would appreciate any insight you might have!

langcog / childes-db-shiny

ppm calculation issues #38