langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

French WG item issues? #256

Open kachergis opened 2 years ago

kachergis commented 2 years ago

Looks possible that the French WG has has some duplicated items.

1: aïe (sounds), aïe (body part), and aïe bobo (body part) are all on the WG (items 36, 147, and 581). (Only aïe bobo (body part) appears on the WS.) VonHolzen WG has items 36 and 147; Bermann WG has items 36 and 581, but not 147 -- so it seems 581 and 147 (aïe (body part), and aïe bobo (body part)) should be combined

2: dent (body part), brosse a dent (household), and dent (household) all appear on the WG (items 152, 229, and 234), and only brosse a dent (household) and dent (body part) appear on the WS (211 and 263). VonHolzen WG data has all three 'dent' items (152, 229, and 234), while Bergmann WG data has only 152 and 229...so either some forms truly have 3 'dent' items, or VonHolzen has a duplicate within the form -- anybody have a physical copy? I'm still suspicious of item_234 dent (household) -- and the [French_French_WG].csv file has not only itemID and item columns, but also an item_id column (unusual..)

3: item_746 "finir recevoir" is likely meant to be two items, "finir" and "recevoir", which should both be on the French WG (and in another dataset are right next to each other)

kachergis commented 2 years ago

"finir recevoir" is not the only combined item in FrenchFrenchWG_Bergmann_fields.csv -- see also item_747 "donner aller", item_757 "aller bien avec", and possibly more -- maybe a delimiter issue?

kachergis commented 2 years ago

after emailing with Cecile Crimon from Sho's lab, Christina Bergmann, and Katie Von Holzen, I think I know all the problems--and unfortunately can't fix all of them, at least not without Sho's lab going back through Christina's original contribution. Quick summary (will put more on the GH issue):

  1. "finir recevoir" and "donner aller" are neighboring rows in the original questionnaire_vocabulaire_general_WS.xlsx form we received from Christina, and should definitely be 4 items ("finir", "recevoir", "donner", and "aller").
  2. as suggested by the filename, and confirmed by Cecile, the Bergmann data is not WG data, but is WS data...so we need to reimport.
  3. according to Cecile, the WS form (i.e., currently the Bergmann) WG data) should not have "genouillère" ("grenouillère" (i.e. onesie) does exist)
  4. the WG form (VonHolzen data) does indeed have a "dent" household that Cecile says doesn't make sense (there is no household sense of "dent") -> will just remove this word, since it doesn't match WS.