langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

WS-type instruments should not allow word items to have value "understands" #279

Closed mikabr closed 8 months ago

mikabr commented 2 years ago

The distinction between WG-type and WS-type instruments is whether they allow for both understands and produces values for words or for only produces values. There are currently 7 datasets that are using instruments coded as WS-type, but allow for understands values for words (according to their corresponding _values.csv file), which violates this assumption:

  language           form      name        dataset      
  <chr>              <chr>     <chr>       <chr>        
1 Dutch              FormThree Bergman     "BRC"        
2 Dutch              WS        Bergman     "BRC"        
3 English (American) WS        Armon-Lotem "Armon-Lotem"
4 Hebrew             WS        Shalev      ""           
5 Hebrew             WS        Armon-Lotem "Armon-Lotem"
6 Korean             WS        Yim         ""           
7 Slovak             WS        Kapalkova   ""  

Out of the 6 instruments that there 7 datasets are using, 4 of them are only used by datasets that allow understands values, so these instruments should just be reclassified as WG-type:

  language form     
  <chr>    <chr>    
1 Dutch    FormThree
2 Dutch    WS       
3 Hebrew   WS       
4 Slovak   WS 

For the two remaining two instruments, English (American) WS and Korean WS, some datasets allow for understands values and some don't. This is tricker but probably means that the datasets that do allow for understands values should be split off into separate instruments from the datasets that don't, specifically Armon-Lotem for English (American) WS and Yim for Korean WS.

alvinwmtan commented 2 years ago

It seems like researchers sometimes use forms in non-standard ways, including allowing understands for forms labelled as WS—maybe we should respect the forms' "original intent"? Otherwise we would potentially need to modify form_type in the future if someone contributes a new dataset that includes understands. This would mean that there is some wiggle room in defining what "original intent" is, though, so perhaps it's worth discussing.

mikabr commented 2 years ago

I'm not sure how to thing about original intent, but from a data point of view, we need to distinguish between datasets that include comprehension and ones that don't. So datasets that are in theory WS but include comprehension need to classified as "WG-like".

mikabr commented 2 years ago

Decision from discussion -- the issue is only coming for a few relatively small datasets, so we'll let it be for now and potentially fix it later is becomes a bigger issue.

mcfrank commented 2 years ago

decision - we will not fix this right now.