Open kachergis opened 2 years ago
I am generally in favor of this.
There's a few complications though, I think. Some of these are from the Garrison/ Bergelson dataset, where the labels are variants used by specific parent-child dyads (e.g., "paci" is a specific variant used by some parent-child dyad). I think it still makes sense to combine them, but just wanted to flag that this is a specific aspect of the design.
There's also a few that are actually errors - "balls" for instance I think is just due to an import error in attword_processed, see https://github.com/langcog/peekbank-data-import/issues/19
Thanks, Martin! I knew there would be gotchas and Things I Don't Know :)
Also, is "birdie" possibly a badminton birdie? I don't know
I tend to be a lumper rather than a splitter, but I think others might agree with combining at least some of these (maybe just the plurals? and/or the child vs. adult registers?). At least helpful to have a list, I hope.
mutate(english_stimulus_label = case_when( english_stimulus_label=="sippy" ~ "sippycup", english_stimulus_label=="birdie" ~ "birdy", english_stimulus_label=="bird" ~ "birdy", english_stimulus_label=="dog" ~ "doggy", english_stimulus_label=="balls" ~ "ball", english_stimulus_label=="blocks" ~ "block", english_stimulus_label=="shoes" ~ "shoe", english_stimulus_label=="socks" ~ "sock", english_stimulus_label=="cat" ~ "kitty", english_stimulus_label=="kittycat" ~ "kitty", english_stimulus_label=="waterbottle" ~ "water", english_stimulus_label=="paci" ~ "pacifier", TRUE ~ english_stimulus_label ))