langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

Missing data from English (American) WS & WG #300

Closed alvinwmtan closed 8 months ago

alvinwmtan commented 9 months ago

The Thal 16mo dataset is not imported in English (American) WS (missing when pulled from wordbankr).

HenryMehta commented 9 months ago

@alvinwmtan I think I've loaded the data. Could you check and confirm. I am away next week and although I'll have my laptop with me, I'm trying not to work next week.

If you can check today and it hasn't worked I'll take another look this evening or on Saturday

alvinwmtan commented 8 months ago

@HenryMehta I still don't see the data? The dataset appears in the datasets table, but not in administration_data. Also, I double-checked and the 13mo Thal data for English (American) WG is also missing.

alvinwmtan commented 8 months ago

There also appear to be missing data from a number of other datasets—in this case, there is partial missing information. Attached is a CSV from Hiromichi Hagihara detailing the missing data_id values (from a variety of Marchman and Smith datasets): missinIDs_ageM30.csv

HenryMehta commented 8 months ago

@alvinwmtan I am looking at this. I am looking at Thal 13. It seems to me that the administration is there, however, the link to the Child data is not. I'm looking at that and trying to confirm it. Could you share the sql statement you're using to access the data so I can see how you're trying to link the data. Thanks

HenryMehta commented 8 months ago

@alvinwmtan following the above comment, I've made a change to the upload program and loaded Thal 13 in dev. Does this look better?

HenryMehta commented 8 months ago

@alvinwmtan I have now redone the Thal 16 update. Could you confirm if these have worked before I progress. Also, could you let me know which datasets specifically from Marchman and Smith need reviewing because they take good hour plus to load

alvinwmtan commented 8 months ago

@HenryMehta I can see the Thal 13 dataset now in dev, but not the Thal 16.

The datasets with missing IDs are:

All of these have some of the data but are missing a bunch of administrations, especially those from 30-month-old children.

HenryMehta commented 8 months ago

@alvinwmtan Can you share the sql you're using to see the data because I loaded both Thal 13 and 16 the same way

alvinwmtan commented 8 months ago

@HenryMehta here's the SQL query:

SELECT common_administration.id AS administration_id, data_id, date_of_test, age, comprehension, production, is_norming, child_id, dataset_id, age_min, age_max
FROM common_administration
LEFT JOIN common_instrument
ON common_administration.instrument_id = common_instrument.id
LEFT JOIN common_child
ON common_administration.child_id = common_child.id
WHERE instrument_id IN (8)

For the Thal 13 in WG, the last line instead reads WHERE instrument_id IN (7)

I suspect that the issue has to do with importing participants that are at the boundaries of the instrument age range: the range for English (American) WS is (16, 30), and so it might be that participants whose ages are around 16 or 30mo might somehow be excluded from import or not correctly retrieved. Not sure if that is diagnosable on your end.

HenryMehta commented 8 months ago

@alvinwmtan SQL looks right. I'm concentrating on Thal 16 for now. I do not understand why it would work for Thal_13 but not 16. I'm trying to load Thal 16 again now. But as I type this I am wondering if it is something to do with the joins, specifically around the child. If I have time I'll look in more detail once I have it loaded. I want to try the join without the child link

HenryMehta commented 8 months ago

@alvinwmtan I think the issue is we seem to have the datasets (not necessarily the administrations) loaded multiple times.

In production we have Thal WS dataset with id 4 and 653 administrations. We also have it with id 128 and 0 administrations. We also have Thal WG with dataset id 7 and 645 administrations and dataset id 129 and 0 administrations.

I think the administrations are loaded but we need to look at the datasets which seems to have gone wrong.

Could you take a look and tell me if you agree with me.

alvinwmtan commented 8 months ago

I think Thal WG dataset_id 129 has 641 administrations actually (which is why I could see the 13mos).

I think it's correct that there are four datasets labelled Thal: two WS datasets with source (16, 28) and two WG datasets with source (13, 16). I wonder if the issue arises when we have the same dataset_name repeated? Perhaps we should make the dataset_name unique and just use the dataset_origin_name when doing child_id matching.

HenryMehta commented 8 months ago

@alvinwmtan Yes, of course there are multiple datasets. I forgot how it works. I think the problem might have been I thought there was just 1 WG and 1 WS and I was loading the one based on the file. But there are 2 of each. So I have reloaded all 4. Please let me know if this has worked. It looks like it has to me

alvinwmtan commented 8 months ago

@HenryMehta Great, I can see all of them now. So the Thal datasets are resolved, and the ones that remain are the other ones:

The datasets with missing IDs are:

  • Marchman (Norming)
  • Marchman (Wisconsin)
  • Marchman (Dallas)
  • Smith (electronic)
  • Smith (paper)

All of these have some of the data but are missing a bunch of administrations, especially those from 30-month-old children.

HenryMehta commented 8 months ago

@alvinwmtan I found an error in the load for Norming. I corrected that and then reloaded the 5 (in dev). Please confirm if this worked. If so, I'll load Thal and these 5 to prod.

alvinwmtan commented 8 months ago

@HenryMehta I think it has worked—I can see them all just fine. Thank you!

HenryMehta commented 8 months ago

@alvinwmtan I've now applied all these to production. Please confirm ok and I'll close the issue and more onto Wordbank2.1

alvinwmtan commented 8 months ago

@HenryMehta looks good on my end, thanks!