langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

incoming Korean data #293

Closed mcfrank closed 7 months ago

mcfrank commented 1 year ago

korean_cdi_ws.csv korean_cdi_wg.csv item_korean_ws.csv item_korean_wg.csv

See below for correspondence from Eon-Suk Ko and team:

message 1

2023년 4월 21일 (금) 오전 1:35, Jun Ho CHAI [junhoc94@gmail.com](mailto:junhoc94@gmail.com)님이 작성: Hello everyone,

We are excited to share the Korean CDI data collected by our lab members at ChosunBabyLab, Chosun University.

Here, I have attached the CDI data ("koreancdi") for word and gesture (WG) and word and sentence (WS) forms, along with the item data ("itemkorean") which summarizes the items used in the WS and WG forms. I have taken great care to clean the data so that it matches the data structure we pulled from Wordbank.

Let me know if you have any suggestions/questions.

Regards, Jun Ho

message 2

On Fri, Apr 21, 2023 at 7:22 AM Eon-Suk Ko [eonsuk@gmail.com](mailto:eonsuk@gmail.com) wrote: Jun Ho, thanks for cleaning up the data.

Wordbank Team, our contribution contains data from 482 Korean children, including 222 infants from the WG form covering 8 to 17 month olds, and 260 children from the WS form covering 18 to 36 months. We currently do not have a particular article that we would like to associate with the data set as a whole.

There is a small glitch in the currently submitted data. We will fix it and re-submit it.

Regards, Eon-Suk Ko

message 3

Okay, so the glitch is not really a glitch on our part. The item list we provided is based on the existing Korean entries in Wordbank, and the item_642 in the WS list, i.e., "낱말조합여부" translated as "combine" is a section header for grammatical morphemes rather than an actual word.

Jun Ho and I think it does no harm to leave it as it is. So, we are happy to say the data we provided is ready to be loaded.

Regards, Eon-Suk

alvinwmtan commented 1 year ago

The data files have some NA issues:

Additionally, I cannot figure out what the encoding format of the input is; it does not appear to be any of: Macintosh, Windows (ANSI), MS-DOS, Unicode (-7, -8, -16), Korean (EUC, Mac, Windows). It would be great if we could get the files in UTF-8 format from the contributors so that we can have the Korean text correctly rendered.

alvinwmtan commented 1 year ago

Processed files: [Korean_WGComp].csv KoreanWGComp_Chosun_data.csv KoreanWGComp_Chosun_fields.csv KoreanWGComp_Chosun_values.csv KoreanWS_Chosun_data.csv KoreanWS_Chosun_fields.csv KoreanWS_Chosun_values.csv

Note the creation of a new form, Korean WGComp, to accommodate the non-standard comprehension-only WG data. (Policy will also be applied to #252.)

HenryMehta commented 10 months ago

@alvinwmtan may I have contributor and citation for both datasets please

HenryMehta commented 10 months ago

@alvinwmtan data loaded and deployed to dev

alvinwmtan commented 10 months ago

Citation: Jung, J., Chai, J., & Ko, E. (2023, December 8). The Interplay of Family Socioeconomic Status, Parental Engagement, and Maternal Employment on Vocabulary Development in Korean Children. https://doi.org/10.31234/osf.io/depvx

Contributor: Eon-Suk Ko, Chosun University

The contributors would also prefer CC-BY-NC

alvinwmtan commented 9 months ago

@HenryMehta looks good to me from R