langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

Hoff data only pulls from R01 #282

Closed alvinwmtan closed 2 years ago

alvinwmtan commented 2 years ago

There are two sets of CSVs for the Hoff English–Spanish bilingual dataset (HDPM and R01), but only the R01 dataset appears when pulling data. I'm not sure why this is the case, but if the loading script isn't happy about there being multiple files for the same dataset, it's okay to split the two CSV sets into two datasets (but retain the same dataset_origin_name).

HenryMehta commented 2 years ago

@alvinwmtan Are you saying that only R01 data is in the database?

alvinwmtan commented 2 years ago

@HenryMehta I think that's the case, yes.

HenryMehta commented 2 years ago

@alvinwmtan I need some time to think about how to fix this. Basically the most recently loaded dataset overwrites the previous one because it is saving them on dataset_origin_id (which is allowing comparison across forms and languages but seems to be preventing it within a form). I have a thought about how to fix but I don't have much time today and I'm away 3 days next week and I don't want to start and then have to remember where I got to, so I would rather leave this for a couple of weeks until I can tackle it in one go

HenryMehta commented 2 years ago

@alvinwmtan @mikabr I've been looking at this a bit more. Two options and I would like your preference because I don't know the R impacts.

The current Dataset model is as per our original spec and looks like this:


class Dataset(models.Model):
    dataset_name = models.CharField(max_length=20)
    contributor = models.TextField(blank=True)
    citation = models.TextField(blank=True)
    licenses = (('CC-BY', 'CC BY 4.0'),
                ('CC-BY-NC', 'CC BY-NC 4.0'))
    license = models.CharField(max_length=15, choices=licenses)

    instrument = models.ForeignKey('Instrument', on_delete=models.CASCADE)
    dataset_origin = models.ForeignKey(DatasetOrigin, on_delete=models.CASCADE)

    longitudinal = models.BooleanField(default=False)

We load this information from a json file which has a record that looks like:

{
        "name": "Marchman",
        "dataset": "Dallas Bilingual",
        "instrument_language": "Spanish (Mexican)",
        "instrument_form": "WS",
        "file": "raw_data/Spanish_Mexican_WS/SpanishMexicanWS_Marchman_Dallas.csv",
        "splitcol": false,
        "norming": true,
        "longitudinal": false,
        "date_format": "%Y-%m-%d",
        "contributor": "Donna Jackson-Maldonado, Universidad Autónoma de Querétaro",
        "license": "CC-BY",
        "citation": "Marchman, V. A., Martínez-Sussmann, C., & Dale, P. S. (2004). The language-specific nature of grammatical development: Evidence from bilingual language learners. Developmental Science, 7(2), 212–224.",
        "dataset_origin": "Marchman Dallas Bilingual"
    },

dataset_name in the model is the name field in the json, ie marchman in the example above.

We use the dataset within the json file to create dataset_origin the model. But we can also specify the dataset_origin. The dataset_origin is used to enable children across different instruments and forms etc.

In this case it is a different dataset but not a different instrument or form!

We need to save dataset from the json file somewhere other than dataset_origin

Option 1: Add a field called dataset_daataset or something else which stores the dataset field from the json file. Option 2: Amend dataset_name to be name and dataset from the json file.

I prefer Option 2, but I don't know how the fields are used in R, so I need your input

HenryMehta commented 2 years ago

I think the above explanation is clearer if you also have the hoff json definition:


    {
        "name": "Hoff",
        "dataset": "HDPM",
        "instrument_language": "English (American)",
        "instrument_form": "WS",
        "file": "raw_data/English_American_WS/EnglishWS_Hoff_HDPM.csv",
        "splitcol": false,
        "norming": false,
        "longitudinal": true,
        "contributor": "Hoff, E",
        "license": "CC-BY",
        "citation": "Hoff, E., Core, C., Place, S., Rumiche, R., Señor, M., & Parra, M. (2012). Dual language exposure and early bilingual development. Journal of child language, 39(1), 1-27.",
        "dataset_origin": "Hoff_English_Mexican_Bilingual"
    },
    {
        "name": "Hoff",
        "dataset": "R01",
        "instrument_language": "English (American)",
        "instrument_form": "WS",
        "file": "raw_data/English_American_WS/EnglishWS_Hoff_R01.csv",
        "splitcol": false,
        "norming": false,
        "longitudinal": true,
        "contributor": "Hoff, E",
        "license": "CC-BY",
        "citation": "Hoff, E., Quinn, J. M., & Giguere, D. (2018). What explains the correlation between growth in vocabulary and grammar? New evidence from latent change score analyses of simultaneous bilingual development. Developmental science, 21(2), e12536.",
        "dataset_origin": "Hoff_English_Mexican_Bilingual"
    },
alvinwmtan commented 2 years ago

@HenryMehta I think option 2 is good!

HenryMehta commented 2 years ago

@alvinwmtan I've deployed in dev. Please take a look and tell me if ok. If so, I'll implement Friday (I'm away Tuesday - Thrusday)

alvinwmtan commented 2 years ago

@HenryMehta I can see the R01 data, and the HDPM dataset appears in common_datasets but the data don't seem to appear in common_administration? Also, I think reloading didn't flush out the old datasets, so all the datasets/data appear twice.

HenryMehta commented 2 years ago

@alvinwmtan ok - that's a bit of a problem then. It'll take me longer to sort but I'll try and take a look on Friday

mcfrank commented 2 years ago

also note that I just got an email from Erika Hoff:

The HDPM subject numbers are published in Hoff et al. 2012:

Hoff, E., Core, C., Place, S., Rumiche, R., Señor, M., & Parra, M. (2012). Dual language exposure and early bilingual development. Journal of child language, 39(1), 1-27.

The R01 subject numbers are published in this:

Hoff, E., Quinn, J. M., & Giguere, D. (2018). What explains the correlation between growth in vocabulary and grammar? New evidence from latent change score analyses of simultaneous bilingual development. Developmental science, 21(2), e12536.
HenryMehta commented 2 years ago

@alvinwmtan I've had another go at this. I've had to do it in a new database so when checking please use wordbank2-dev-2 where previously you have wordbank2-dev. I have resorted to option 1.

If this has worked, can you also check if we have a similar issue with PoulinDubois. Thanks

alvinwmtan commented 2 years ago

@HenryMehta it works now, thanks! What's the name of the new field that you've added—is it dataset_dataset?

I don't think there's an issue with Poulin-Dubois—there aren't datasets with the same instrument and dataset_origin.

HenryMehta commented 2 years ago

Good. The new field is dataset_source. I'll work on getting this to production this evening/tomorrow

HenryMehta commented 2 years ago

@alvinwmtan I think I have the data all deployed to production now. The code changes haven't deployed and I'll look at that tomorrow

HenryMehta commented 2 years ago

@alvinwmtan it is now all deployed