Closed alvinwmtan closed 2 years ago
@alvinwmtan I think the only way to do this is create a new field, race
. Existing datasets will then all need changing if they have a mixed ethnicity/race field, that is the _data.csv file will need changing, with the single column split in 2. If it actually only holds race
then the _values.csv file will need amending to upload to race
instead
OR
are you saying the only possible entries are as you've given to H will only ever be Hispanic and refer to ethnicity and will never ever be used for a new race field. ie the possibly values for ethnicity and race will be mutually exclusive
@HenryMehta yes I think we should have a new field race
. Existing datasets have a mixed ethnicity/race field, but the Hoff dataset as well as other incoming datasets will have separate ethnicity and race fields.
ok - does that mean we're going back over existing datasets and changing the data files and reloading them
yup, that's right. would you be able to work on this? it should be doable programmatically
I would want to do it by changing and reloading the datasets because if we need to rebuild at some point it means we just load the datasets. I'll pick it up
@alvinwmtan I have created the relevant fields and amended the load files and applied it to Henry Test Language. Before I start playing with datasets and splitting race from ethnicity in real data I would like confirmation this looks right at your end.
You can see the input data within the wordbank2 branch. Could you doing whatever you do with R to confirm it is appropriate for that. Once confirmed, I'll apply data changes to the relevant datasets Thanks
@HenryMehta looks good from R, thank you!
@HenryMehta sorry just to double check: have you applied this to all datasets? if so I'll just quickly check some of the other relevant ones (e.g. English (American))
@alvinwmtan I have this morning worked through all those studies I've been able to find with ethnicity and I applied the changes to them
@HenryMehta ethnicity seems to be not implemented for some datasets:
@alvinwmtan I have run them all again but I think it has been implemented. Some of these datasets are just setting ethnicity values to blank so I didn't bother to add a race field to do the same
@HenryMehta I think I've figured out the issue—some of the values files had the ethnicity and race the wrong way around, which was resulting in the values not being properly populated. The Hoff ones also didn't have the ethnicity/race coding entered so I put them in. Here are the corrected versions:
English (American) WS: EnglishWS_Edgin_values.csv EnglishWS_Marchman_Dallas_Bilingual_values.csv EnglishWS_Marchman_Dallas_values.csv EnglishWS_Marchman_Norming_values.csv
EnglishWS_Hoff_HDPM_data.csv EnglishWS_Hoff_HDPM_fields.csv EnglishWS_Hoff_HDPM_values.csv
EnglishWS_Hoff_R01_data.csv EnglishWS_Hoff_R01_fields.csv EnglishWS_Hoff_R01_values.csv
Spanish (Mexican) WS: SpanishMexicanWS_Marchman_Dallas_data.csv SpanishMexicanWS_Marchman_Dallas_fields.csv SpanishMexicanWS_Marchman_Dallas_values.csv
SpanishMexicanWS_Hoff_HDPM_data.csv SpanishMexicanWS_Hoff_HDPM_fields.csv SpanishMexicanWS_Hoff_HDPM_values.csv
SpanishMexicanWS_Hoff_R01_data.csv SpanishMexicanWS_Hoff_R01_fields.csv SpanishMexicanWS_Hoff_R01_values.csv
@alvinwmtan I've copied in these datasets and rerun the data load
@HenryMehta looks good, thanks!
Past Wordbank data contained one single ethnicity field with values:
ethnicities = (('A', 'Asian'), ('B', 'Black'), ('H', 'Hispanic'), ('W', 'White'), ('O', 'Other/Mixed'))
We should update this to more current definitions, which have
ethnicity = (('H', 'Hispanic'), ('N', 'Non-Hispanic'))
andrace = (('A', 'Asian'), ('B', 'Black'), ('W', 'White'), ('O', 'Other/Mixed'))
This involves:
race
as a new fieldA, B, W, O
values intorace
race
andethnicity
as NAThis is known to affect the following languages: English (American), English (British), British Sign Language, Spanish (Mexican) (and probably only these languages, since ethnicity was sparse to begin with). Some helper scripts / API scripts may also need to be updated.
Note: should also resolve #226, and also anticipates incoming Web-CDI data (which have ethnicity and race as separate fields).