Closed alvinwmtan closed 2 years ago
@alvinwmtan Are you saying that only R01 data is in the database?
@HenryMehta I think that's the case, yes.
@alvinwmtan I need some time to think about how to fix this. Basically the most recently loaded dataset overwrites the previous one because it is saving them on dataset_origin_id (which is allowing comparison across forms and languages but seems to be preventing it within a form). I have a thought about how to fix but I don't have much time today and I'm away 3 days next week and I don't want to start and then have to remember where I got to, so I would rather leave this for a couple of weeks until I can tackle it in one go
@alvinwmtan @mikabr I've been looking at this a bit more. Two options and I would like your preference because I don't know the R impacts.
The current Dataset model is as per our original spec and looks like this:
class Dataset(models.Model):
dataset_name = models.CharField(max_length=20)
contributor = models.TextField(blank=True)
citation = models.TextField(blank=True)
licenses = (('CC-BY', 'CC BY 4.0'),
('CC-BY-NC', 'CC BY-NC 4.0'))
license = models.CharField(max_length=15, choices=licenses)
instrument = models.ForeignKey('Instrument', on_delete=models.CASCADE)
dataset_origin = models.ForeignKey(DatasetOrigin, on_delete=models.CASCADE)
longitudinal = models.BooleanField(default=False)
We load this information from a json file which has a record that looks like:
{
"name": "Marchman",
"dataset": "Dallas Bilingual",
"instrument_language": "Spanish (Mexican)",
"instrument_form": "WS",
"file": "raw_data/Spanish_Mexican_WS/SpanishMexicanWS_Marchman_Dallas.csv",
"splitcol": false,
"norming": true,
"longitudinal": false,
"date_format": "%Y-%m-%d",
"contributor": "Donna Jackson-Maldonado, Universidad Autónoma de Querétaro",
"license": "CC-BY",
"citation": "Marchman, V. A., Martínez-Sussmann, C., & Dale, P. S. (2004). The language-specific nature of grammatical development: Evidence from bilingual language learners. Developmental Science, 7(2), 212–224.",
"dataset_origin": "Marchman Dallas Bilingual"
},
dataset_name
in the model is the name field in the json, ie marchman in the example above.
We use the dataset
within the json file to create dataset_origin
the model. But we can also specify the dataset_origin
. The dataset_origin
is used to enable children across different instruments and forms etc.
In this case it is a different dataset but not a different instrument or form!
We need to save dataset from the json file somewhere other than dataset_origin
Option 1: Add a field called dataset_daataset
or something else which stores the dataset
field from the json file.
Option 2: Amend dataset_name
to be name
and dataset
from the json file.
I prefer Option 2, but I don't know how the fields are used in R, so I need your input
I think the above explanation is clearer if you also have the hoff json definition:
{
"name": "Hoff",
"dataset": "HDPM",
"instrument_language": "English (American)",
"instrument_form": "WS",
"file": "raw_data/English_American_WS/EnglishWS_Hoff_HDPM.csv",
"splitcol": false,
"norming": false,
"longitudinal": true,
"contributor": "Hoff, E",
"license": "CC-BY",
"citation": "Hoff, E., Core, C., Place, S., Rumiche, R., Señor, M., & Parra, M. (2012). Dual language exposure and early bilingual development. Journal of child language, 39(1), 1-27.",
"dataset_origin": "Hoff_English_Mexican_Bilingual"
},
{
"name": "Hoff",
"dataset": "R01",
"instrument_language": "English (American)",
"instrument_form": "WS",
"file": "raw_data/English_American_WS/EnglishWS_Hoff_R01.csv",
"splitcol": false,
"norming": false,
"longitudinal": true,
"contributor": "Hoff, E",
"license": "CC-BY",
"citation": "Hoff, E., Quinn, J. M., & Giguere, D. (2018). What explains the correlation between growth in vocabulary and grammar? New evidence from latent change score analyses of simultaneous bilingual development. Developmental science, 21(2), e12536.",
"dataset_origin": "Hoff_English_Mexican_Bilingual"
},
@HenryMehta I think option 2 is good!
@alvinwmtan I've deployed in dev. Please take a look and tell me if ok. If so, I'll implement Friday (I'm away Tuesday - Thrusday)
@HenryMehta I can see the R01 data, and the HDPM dataset appears in common_datasets
but the data don't seem to appear in common_administration
? Also, I think reloading didn't flush out the old datasets, so all the datasets/data appear twice.
@alvinwmtan ok - that's a bit of a problem then. It'll take me longer to sort but I'll try and take a look on Friday
also note that I just got an email from Erika Hoff:
The HDPM subject numbers are published in Hoff et al. 2012:
Hoff, E., Core, C., Place, S., Rumiche, R., Señor, M., & Parra, M. (2012). Dual language exposure and early bilingual development. Journal of child language, 39(1), 1-27.
The R01 subject numbers are published in this:
Hoff, E., Quinn, J. M., & Giguere, D. (2018). What explains the correlation between growth in vocabulary and grammar? New evidence from latent change score analyses of simultaneous bilingual development. Developmental science, 21(2), e12536.
@alvinwmtan I've had another go at this. I've had to do it in a new database so when checking please use wordbank2-dev-2 where previously you have wordbank2-dev. I have resorted to option 1.
If this has worked, can you also check if we have a similar issue with PoulinDubois. Thanks
@HenryMehta it works now, thanks! What's the name of the new field that you've added—is it dataset_dataset
?
I don't think there's an issue with Poulin-Dubois—there aren't datasets with the same instrument and dataset_origin.
Good. The new field is dataset_source. I'll work on getting this to production this evening/tomorrow
@alvinwmtan I think I have the data all deployed to production now. The code changes haven't deployed and I'll look at that tomorrow
@alvinwmtan it is now all deployed
There are two sets of CSVs for the Hoff English–Spanish bilingual dataset (HDPM and R01), but only the R01 dataset appears when pulling data. I'm not sure why this is the case, but if the loading script isn't happy about there being multiple files for the same dataset, it's okay to split the two CSV sets into two datasets (but retain the same dataset_origin_name).