langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

Separate ethnicity and race fields #263

Closed alvinwmtan closed 2 years ago

alvinwmtan commented 2 years ago

Past Wordbank data contained one single ethnicity field with values: ethnicities = (('A', 'Asian'), ('B', 'Black'), ('H', 'Hispanic'), ('W', 'White'), ('O', 'Other/Mixed'))

We should update this to more current definitions, which have ethnicity = (('H', 'Hispanic'), ('N', 'Non-Hispanic')) and race = (('A', 'Asian'), ('B', 'Black'), ('W', 'White'), ('O', 'Other/Mixed'))

This involves:

This is known to affect the following languages: English (American), English (British), British Sign Language, Spanish (Mexican) (and probably only these languages, since ethnicity was sparse to begin with). Some helper scripts / API scripts may also need to be updated.

Note: should also resolve #226, and also anticipates incoming Web-CDI data (which have ethnicity and race as separate fields).

HenryMehta commented 2 years ago

@alvinwmtan I think the only way to do this is create a new field, race. Existing datasets will then all need changing if they have a mixed ethnicity/race field, that is the _data.csv file will need changing, with the single column split in 2. If it actually only holds race then the _values.csv file will need amending to upload to race instead

OR

are you saying the only possible entries are as you've given to H will only ever be Hispanic and refer to ethnicity and will never ever be used for a new race field. ie the possibly values for ethnicity and race will be mutually exclusive

alvinwmtan commented 2 years ago

@HenryMehta yes I think we should have a new field race. Existing datasets have a mixed ethnicity/race field, but the Hoff dataset as well as other incoming datasets will have separate ethnicity and race fields.

HenryMehta commented 2 years ago

ok - does that mean we're going back over existing datasets and changing the data files and reloading them

alvinwmtan commented 2 years ago

yup, that's right. would you be able to work on this? it should be doable programmatically

HenryMehta commented 2 years ago

I would want to do it by changing and reloading the datasets because if we need to rebuild at some point it means we just load the datasets. I'll pick it up

HenryMehta commented 2 years ago

@alvinwmtan I have created the relevant fields and amended the load files and applied it to Henry Test Language. Before I start playing with datasets and splitting race from ethnicity in real data I would like confirmation this looks right at your end.

You can see the input data within the wordbank2 branch. Could you doing whatever you do with R to confirm it is appropriate for that. Once confirmed, I'll apply data changes to the relevant datasets Thanks

alvinwmtan commented 2 years ago

@HenryMehta looks good from R, thank you!

alvinwmtan commented 2 years ago

@HenryMehta sorry just to double check: have you applied this to all datasets? if so I'll just quickly check some of the other relevant ones (e.g. English (American))

HenryMehta commented 2 years ago

@alvinwmtan I have this morning worked through all those studies I've been able to find with ethnicity and I applied the changes to them

alvinwmtan commented 2 years ago

@HenryMehta ethnicity seems to be not implemented for some datasets:

HenryMehta commented 2 years ago

@alvinwmtan I have run them all again but I think it has been implemented. Some of these datasets are just setting ethnicity values to blank so I didn't bother to add a race field to do the same

alvinwmtan commented 2 years ago

@HenryMehta I think I've figured out the issue—some of the values files had the ethnicity and race the wrong way around, which was resulting in the values not being properly populated. The Hoff ones also didn't have the ethnicity/race coding entered so I put them in. Here are the corrected versions:

English (American) WS: EnglishWS_Edgin_values.csv EnglishWS_Marchman_Dallas_Bilingual_values.csv EnglishWS_Marchman_Dallas_values.csv EnglishWS_Marchman_Norming_values.csv

EnglishWS_Hoff_HDPM_data.csv EnglishWS_Hoff_HDPM_fields.csv EnglishWS_Hoff_HDPM_values.csv

EnglishWS_Hoff_R01_data.csv EnglishWS_Hoff_R01_fields.csv EnglishWS_Hoff_R01_values.csv

Spanish (Mexican) WS: SpanishMexicanWS_Marchman_Dallas_data.csv SpanishMexicanWS_Marchman_Dallas_fields.csv SpanishMexicanWS_Marchman_Dallas_values.csv

SpanishMexicanWS_Hoff_HDPM_data.csv SpanishMexicanWS_Hoff_HDPM_fields.csv SpanishMexicanWS_Hoff_HDPM_values.csv

SpanishMexicanWS_Hoff_R01_data.csv SpanishMexicanWS_Hoff_R01_fields.csv SpanishMexicanWS_Hoff_R01_values.csv

HenryMehta commented 2 years ago

@alvinwmtan I've copied in these datasets and rerun the data load

alvinwmtan commented 2 years ago

@HenryMehta looks good, thanks!