langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

Finnish data #232

Closed mcfrank closed 6 months ago

mcfrank commented 2 years ago

Dear Mike,

I just sent you Finnish CDI data via Funet FileSender. I sent four SPPS files. Two of those include data collected using the long form versions of the Finnish CDI, and two files include data collected using the short form versions of the Finnish CDI. Data will be available for one week.

Following files include longitudinal data collected using the long form versions of the CDI at 12, 15, 18 and at 24 months of age (N=35):

Finnish CDI Long form WG Longitudinal at 12 and at 15 months of age Finnish CDI Long form WS Longitudinal at 18 and at 24 months of age

Please cite: Stolt, S., Haataja, L., Lapinleimu, H. & Lehtonen, L. 2008. Early lexical development of Finnish children – a longitudinal study. First Language, 28(3), 259–279. DOI: 10.1177/0142723708091051

The following files include longitudinal data collected using the short form versions of the Finnish CDI at 12, 15, 18, and at 24 months of age (N=82). At 18 months, data has been collected using both versions (Infant and Toddler versions).

Finnish CDI Short form versions Infant version Longitudinal at 9, 12, 15 and at 18 months of age Finnish CDI Short form versions Toddler version Longitudinal at 18 and at 24 months of age

Please cite:

Stolt, S. & Vehkavuori, S-M. 2018. Sanaseula. Finnish short form versions of the MacArthur Communicative Development Inventories. Jyväskylä: Niilo Mäki Instituutti.

If there is anything you want to ask, or you need clarification, please do not hesitate to contact.

Thank you once again for your interest - I am happy that there is an opportunity to share Finnish data. Wishing you all the best, Suvi

transfer_145907_files_db8a4049.zip

HenryMehta commented 2 years ago

@mcfrank I have nothing to open a sav file. Can you advice what program I should be using

mcfrank commented 2 years ago

tagging @jflanaga so he is aware of this thread and can contribute.

jflanaga commented 2 years ago

@HenryMehta .sav files is the format for SPSS. If you don't have SPSS, you can also use R (and I'm sure Python has tools as well). I can give you the scripts I wrote to import the data or I can give you the data in a format you'd prefer. Just let me know what you would like. I'm pretty much finished up with the WG and WS long form files, but I haven't had a chance to look at the short forms yet.

vmarchman commented 2 years ago

@jflanaga That's great that you can help out with the Finnish data! I had started to work with it as well, but I'd be happy to have you take over from here! :-)

I'm sure Henry would appreciate the files to be already in csv format, if you can convert and restructure that would be great.

One issue I ran into is that it is not clear that there is comprehension data for the WG forms. That is, there are only responses of 0 vs. 1, which seem to map onto produced. I have emailed the author (Suvi Stolt) and will let you know when I get clarification.

@Henry @.***> will also need a gloss file. Here's the one that I created for WG long form. @jflanaga - in the gloss file, I fixed some of the spellings and inconsistencies, so maybe that would be helpful for the other forms.

I'm also attaching how I re-structured the WG data for Henry (long for subject/administration with age column, and wide for items)

Let me know if I can help further. Thanks again for your help with this!

Virginia

On Tue, Mar 22, 2022 at 11:51 AM Joseph Flanagan @.***> wrote:

@HenryMehta https://github.com/HenryMehta .sav files is the format for SPSS. If you don't have SPSS, you can also use R (and I'm sure Python has tools as well). I can give you the scripts I wrote to import the data or I can give you the data in a format you'd prefer. Just let me know what you would like. I'm pretty much finished up with the WG and WS long form files, but I haven't had a chance to look at the short forms yet.

— Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/232#issuecomment-1075513003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2TUTEZGMWNC2AB3XMRGR3VBIJE7ANCNFSM5GOP4KGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--


Virginia A. Marchman, Ph.D. @.*** Research Associate Language Learning Lab Psychology Stanford University Stanford, CA 94305 Direct: 650-725-7493 Lab: 650-723-1257


jflanaga commented 2 years ago

@vmarchman Yeah, I couldn't quite figure out what's going on there. The first column had options for understands/says. But then the rest were coded as 0 or 1. However, the variable labels (if that's the right term), combine the age (in months), the measure, and the Finnish word: 15kk ymmärtää sammakko (15 kk = 15 months, ymmärtää = understands, sammakko = frog). There are only 4 words that have been coded for only comprehension:

age_group gloss age   measure  definition
c         frog      15    ymmärtää sammakko  
c         pushchair 15    ymmärtää rattaat   
c         cake      15    ymmärtää kakku     
c         butter    15    ymmärtää voi   

I thought that maybe when they realized comprehension was only used four times that they just used 0 and 1 for the rest.

I'm a bit surprised that there are only four words where it's just comprehension, but parents are probably reluctant to say that a child understands a word if she doesn't also say it.

Or maybe it should have been sanoo (says) like the others. That might actually be more likely.

Did you say that you attached something? I didn't see it.

vmarchman commented 2 years ago

@jflanaga

For the WG, all items should have a choice of "understands" or "understands and says". There should be, actually, three options, something like doesn't understand or say (which I would say would be 0), 1 = understands, but does not produce, and 2 = says (with understanding). Or something like that. Sometimes people put understands and understands and says in different variables, which it sounds like they've done here and only those 4 items had their "understands" data ported over. My guess is that there is an exporting error here. I will go back to the authors with this question.

The assumption of the CDIs is that a child can be reported to understand a word, without being able to say it, but all words that a child is reported to say, they must have some meaning, hence, "understands and says".

This is different for the WS form for older children. All of the items only have one choice, i.e., "understands and says" or just "says" - so for those forms 0 (doesn't say) and 1 (says) makes sense.

In the meantime, we can work with the WS forms.

I had attached a gloss file and a csv of the WG long forms which shows the format that they should be in - although this is wrong (as discussed above), but shows you the format (kids and admins long, and items wide).

Finnish_WG_itemsglosses.xlsx FinnishCDILong form_WG_Longitudinal_at12and15months.csv

jflanaga commented 2 years ago

@vmarchman @HenryMehta I checked the short forms of WG, and they too are missing comprehension data, so I guess it's a general issue. Maybe there were two SPSS files, one for comprehension, the other for production, and it's only the production data that got included here. All the words that are on the WG form looks as if they are in the dataset. But we just need to wait for Professor Stolt.

What's the policy about providing English glosses? Do we stick with the original? There are a couple of examples where it an alternative might be better. For instance, the Finnish word apina could refer to either "ape" or "monkey." It's given as "ape." sukkahousut is translated as "pantyhose", but my sense is that "pantyhouse" is what is worn by adult women, not by kids (I use the word "tights" or "leggings" for what children wear). The files distinguish between "fish" as a food and "fish" as an animal by "fish" and "fish_x". Do we keep that? And what about multiword expressions? Sometimes, they are separated with an underscore "rocking chair" and sometimes as a single word (rubberboots, bellybutton, etc.).

For the csv files shown above, the data is in wide form with two header rows. Is that the preferred format?

jflanaga commented 2 years ago

By the way, here's the R script I used to import and parse the WS data. The WG data is much the same. I should probably take the time to make this into a function, but I'm not a programmer and violate principles quite a bit.

library(tidyverse)
library(haven)
library(foreign)

# meta-data -------------------------------------------------------------------------------------------------------------
# use the read.spss() function from foreign to get the variable labels

ws_meta <-  read.spss("data/raw/finnish_cdi/Finnish CDI Long form _ WS _ Longitudinal at 18 months and at 24 months of age_corrected.sav", 
                      to.data.frame=TRUE)

# convert the attributes to a dataframe (metadata are row names)
ws_labels <- as.data.frame(attr(ws_meta, "variable.labels"))

ws_labels_tibble <- rownames_to_column(ws_labels, "variable_name") |> 
  # rename the attr
  rename(`age measure definition` = "attr(ws_meta, \"variable.labels\")") |> 
  #convert to tibble
  as_tibble() |>  
  # remove first three rows (no useful data)
  slice(4:n())

ws_labels_tibble <- ws_labels_tibble |> 
  # there's another VAR variable to remove
  filter(variable_name != "VAR00002") |> 
  # split the columns 
  extract(variable_name, into = c("age_group", "gloss"), regex = "(.*?)_(.*)") |> 
  extract(`age measure definition`, into = c("age", "measure", "definition"), regex = "(.*?)\\s(.*?)\\s(.*)") |> 
  extract(definition, into = c("definition", "junk"), regex = "^((?:(?! (\\(|\\/)).)*)") |> 
  select(-junk)

# trim white space
ws_labels_tibble$gloss <- str_trim(ws_labels_tibble$gloss, "both")
ws_labels_tibble$definition <- str_trim(ws_labels_tibble$definition, "both")

# remove quote marks
ws_labels_tibble$definition <- gsub('[\"]', '', ws_labels_tibble$definition)

# data frame containing just the unique gloss-definition pairings (595 words)
ws_labels_tibble2 <- ws_labels_tibble |> 
  select(gloss, definition) |> 
  distinct()

## data -----------------------------------------------------------------------------------------------------------------------
# read_sav() is a bit easier when interested in the data

ws_wide <- read_sav("data/raw/finnish_cdi/Finnish CDI Long form _ WS _ Longitudinal at 18 months and at 24 months of age_corrected.sav")

ws_long <- ws_wide |> 
  select(-starts_with("VAR")) |> 
  pivot_longer(-c(ID, gender), names_to = c("age_group", "gloss"), names_pattern = '(.*?)_(.*)')

## join up with Finnish words
full_data <- left_join(ws_long, ws_labels_tibble2)

# add age variable
full_data <- full_data |> 
  mutate(age = case_when(age_group == "B" ~ 18,
                         TRUE ~ 24))

## remove label from the value variable (messed up when data goes back to wide format)
attr(full_data$value, "label") <- NULL

## to wide again ------------------------------------------------------------------------------------------------------

# variable names are finnish words, each row is a separate adminstration
ws_new_wide <- full_data |> 
  select(-gloss, -age_group) |> 
  pivot_wider(names_from = definition, values_from = value) |> 
  arrange(ID)
jflanaga commented 2 years ago

@mcfrank @vmarchman @HenryMehta

I've checked my parsing of the WG with Virginia's and there's only a couple of issues, mostly involving whether to use British English or American English (Viriginia's is the first column, mine is the second). The third is just a misspelling, but I left it here in case Virginia wants to check/modify. Is there a preference? It was the British version in the original.

airplane    aeroplane 
vacuum      hoover    
eyeglassess eyeglasses

One of the expressions appears to be mistranslated in the WG data. The Finnish expression is hei hei and the English gloss is "hey." That's not right. It should be "goodbye". Hei by itself could be translated as "hey/hi", but hei hei is "goodbye. It's also translated that way in the WS data (well, technically it's translated as "bye"). It could also be "bye-bye".

When trying to normalize the data between WS and WG, I saw some differences in how the words were translated:

pieni: tiny or small?

puisto: park or garden (this might be an American versus British distinction, but I think "public garden" in British English is a bit more restricted in its use. Puisto is pretty general)

puhdas: tidy or clean

sisällä: is_inside or inside

alla: below or under

tuolla: down_there or out_there (I would add "over there")

ulos: out or go_out

My own preferences would be puhdas = both "clean" and "tidy" is acceptable, sisällä = "is inside", alla (both below and under are acceptable), tuolla = (that's a bit hard, but probably "there" or "over there". Finnish, however, also has siellä, which I guess could be something other "over yonder"), ulos (probably just "out", but depending upon the context it can mean "go out", but I think that that usage is restricted.

Both TV and television are used both within and across datasets, so I'm not sure which to use.

As I mentioned above, there's also some variation in terms of whether a word is written as two words (separated by an underscore) or as one. And there's a couple of words I'm not sure about the best translation. And how should we handle glosses for genitalia (pippeli and pimppi). I've never heard of the gloss they provide for pimppi (wea wea?) but maybe it's British?

vmarchman commented 2 years ago

@mcfrank @jflanaga I heard back from Suvi about the WG data and she is checking into it, but didn't have an answer just yet.

vmarchman commented 2 years ago

@mcfrank @kachergis @jflanaga I'm bringing George into the conversation about how to handle the glosses here. I would think that we would want more American rather than British glosses to conform with other datasets, but not sure how important the other variations are. And, yes, "eyeglassess" is a typo.

jflanaga commented 2 years ago

@vmarchman @HenryMehta @kachergis I went through the WS and WG forms in Finnish (the actual parental reports referred to above) and created a spreadsheet with various information. I mostly kept the glosses that were in the data, and I tried to note places where either there were discrepancies in the data (one person used one gloss, whereas someone else used another), add clarifying notes if I thought a gloss was unclear, etc. I'll put here as a spreadsheet for now as it might be easier to look at and make decisions about how to gloss something. I'm using the words as they appeared in the reports, and not necessarily how they are in the variable labels (there are some differences, which I didn't note).

finnish_wg_report_words.xlsx

finnish-ws-metadata.xlsx

alvinwmtan commented 2 years ago

@jflanaga Thank you for working on the glosses! We've actually been in the process of working on glosses and unilemmas (universal concepts that all languages map to) in a separate document here. I've incorporated all your comments (orange cells), which have been very helpful 😄

jflanaga commented 2 years ago

@alvinwmtan Thanks for that. Would you like for me to add some comments on that document? In places, it’s saying what is more common in Finnish. In a way, that’s a bit irrelevant for this, as the words are those that are actually used on the parental reports. (It’s useful for updating the reports though). In others, I think they’re thinking of something else. I use the word sukkahousut practically everyday. But they’re not for adults, they are for kids. See here. I’ve never heard another word to describe them.

alvinwmtan commented 2 years ago

@jflanaga Sure, feel free to add comments if you would like to. You're right in noting that the focus should be on the words actually used in the CDI, and we'll definitely do another pass through the whole thing eventually while updating all the forms, so we'll check to make sure that the right information gets included.

jflanaga commented 2 years ago

@vmarchman @HenryMehta @vmarchman Other than waiting to hear about the comprehension data for WG, I was wondering what the next step is or how we’re handling it. I like updating the glosses online. But how are putting everything together? Are you using the original SPSS files and then using code to make corrections? I think that’s safer than what I did (edit the SPSS itself, although I did rename it). But then I’m not sure where or when the changes get made.

I’m just a bit uncertain where to go from here. Would you like for me to upload the edited SPSS files I used? There were some misspellings and other things there, but manually changing things isn’t good for documentation. Anyways, I’d be happy to help more but I don’t know what would be best for me to do.

HenryMehta commented 2 years ago

@vmarchman I think it best you confirm everything is sorted to ensure we have the 'right' data and then I can help ensure we've got it formatted correctly to load so I'll leave you to answer Joe's comment above

jflanaga commented 2 years ago

@vmarchman @HenryMehta

I compiled a list of the changes that I made to the original SPSS file (I think I caught them all -- I exported the Variable View from the two SPSS files and then did a diff). The changes fell into three general categories. The most common were misspellings of English words. There were also cases where the value labels appeared to be copied from the row above, so the wrong Finnish word was used. There were also cases where the translation is wrong (or at least inconsistent with how it was translated elsewhere in the file) And then a few cases where a Finnish word was used instead of English (e.g., e_poika instead of e_boy). I tried to provide an explanation for the issue as well as how I corrected it. There was also one additional change I made because I couldn't figure out a regular expression to get what I wanted. It just involved a change that wouldn't affect the Finnish word. I didn't include that example in the spreadsheet.

With the exception of changing TV to television, these changes don't reflect anything like the choice of British or American English, or making sure that the same glosses for the same words are used in both WS and WG. I assume that those could be handled with the online spreadsheet that @alvinwmtan linked to above.

There's one other issue concerning the sex of one of the participants. In the article linked in the OP, it says that there are 18 boys, 17 girls in the dataset. However, the two SPSS files differ in this way: the long form WG reports 18 boys and 17 girls, the long form WS 17 boys and 18 girls. The discrepancy is with ID 29. Given the numbers in the paper, I'm inclined to believe that WG is correct (0/boy), but that's just assuming that the number in the article is correct. (It's possible it was taken was something other than the SPSS files, but if it was based on the first SPSS file they looked at, then we wouldn't know whether WG is correct or WS is correct).

I've also included the output from SPSS for the two files

Diffs: finnish-spss-ws-original.csv

finnish-spss-ws-corrected.csv

Corrections:

corrections-finnish-ws.xlsx

jflanaga commented 2 years ago

@HenryMehta I was wondering if eventually you wanted the data in a format such as that attached. The data is a wide format (which is how I understood you'd want it). Here's an abbreviated view:

ID `B_ai ai` `B_bää-ä bää-ä` `B_hau hau` `B_kvaak kvaak`   B_murrr   B_ohhoh 
 1 0 [ei]          1 [kyllä]   1 [kyllä]       1 [kyllä] 0 [ei]    1 [kyllä]
 2 1 [kyllä]       1 [kyllä]   1 [kyllä]       1 [kyllä] 1 [kyllä] 0 [ei]   
 3 0 [ei]          1 [kyllä]   1 [kyllä]       0 [ei]    1 [kyllä] 1 [kyllä]
 4 0 [ei]          0 [ei]      0 [ei]          0 [ei]    0 [ei]    0 [ei]   
 5 0 [ei]          1 [kyllä]   1 [kyllä]       0 [ei]    0 [ei]    0 [ei]   
 6 1 [kyllä]       1 [kyllä]   1 [kyllä]       1 [kyllä] 0 [ei]    0 [ei]   
 7 0 [ei]          1 [kyllä]   1 [kyllä]       0 [ei]    1 [kyllä] 0 [ei]   
 8 1 [kyllä]       0 [ei]      1 [kyllä]       1 [kyllä] 0 [ei]    1 [kyllä]
 9 1 [kyllä]       0 [ei]      1 [kyllä]       1 [kyllä] 0 [ei]    1 [kyllä]
10 0 [ei]          0 [ei]      1 [kyllä]       0 [ei]    0 [ei]    0 [ei]  

This is the view from RStudio, so the brackets won't be there. It's just a CSV file. But it gives you an idea of the format.

I'm as confident right now as I can be that this is correct. I tried to match up the totals with those in the published article. They are in the right ballpark, but they don't match up exactly. I'd have to spend more time seeing what exactly they did in the article to see if I can figure out why. But I compared the row sums on the data where the English words are the variables (that's more or less the original form of the dataset) with the final version here, and those numbers are the same. Still, it's always possible that I messed something up.

It might be that you'd prefer having separate files for each of the age groups. I can do that as well, but I'm sure that's something you can handle as well.

finnish_ws_wide_data.csv

HenryMehta commented 2 years ago

@jflanaga Am I right in thinking we've got multiple administrations for the same child on a single line? If so, I really need each administration of a separate line but using the same child id. I also need the child's age and sex (if available) on that line.

Something like:

study_id,sex,month,brr-brr,ga-ga,grr,jao,ku-ku,kukuriku,kva-kva,mijau,muu,njam-njam,tu-tu,vau-vau,buba,guska,jare,koka,konj,koza,krava,lane,lav,leptir,lisica,macka,magarac,majmun,medo,mis,ovca,patka,pas,pcela,pijetao,pile,prase,ptica,purica,riba,slon,sova,tele,tigar,vjeverica,vrabac,vuk,zeko,zaba,zivotinja,auto,autobus,avion,bicikl,brod,kamion,motor,tramvaj,vlak,balon,igracka,knjiga,kocka,lopatica,lopta,lutka,olovka,banana,bombon,caj,cokolada,cokolino,grasak,griz,hrana,jabuka,jaje,jogurt,juha,kakao,kasica,keks,kolac,kruh,krumpir,meso,mlijeko,mrkva,naranca,piletina,riba_2,sir,sladoled,sok,spinat,tijesto,voda,cipele,carape,cizme,gacice,gumb,haljina,hlace,jakna,kapa,kaput,majica,papuce,pelena,pidzama,rukavice,suknja,sal,vesta/dzemper,zatvarac,celo,glava,guza,jezik,koljeno,kosa,lice,noga,nokti,nos,obrazi,oko,palac,prst,pupak,ruka,trbuh,uho,usta,zub,dnevna_soba,fotelja,garaza,hladnjak,kada,kauc,kolijevka,krevet,krevetic,kuhinja,kupaonica,ladica,ogradica(vrtic_za_igranje),ormar,pecnica,prozor,stepenice,stol,stolica,stednjak,televizor,tuta,umivaonik,vrata,biljka,boca,bocica_(s_dudom),casa,cekic,cesalj,cetka,cetkica_(za_zube),deka,jastuk,kljucevi,kutija,lijek,metla,naocale,novci,novcanik,noz,papir,radio,rucnik,sapun,sat,slika,smece,svjetiljka,svjetlo,salica,skare,tanjur,telefon,usisavac,vilica,zdjelica,zvono,zlica,bazen,crkva,cvijece,dom,drvo,dvoriste,kamen,kisa,kuca,livada,lopata,ljuljacka,mjesec,nebo,park,plaza,posao,proslava,snijeg,sunce,skola,trgovina,van(i)_2,voda_2,vrt,zooloski_(vrt),zvijezda,baka,beba,brat,dijete,djeca,djecak,djed,djevojcica,doktor,ime_osobe_koja_cuva_dijete,kum_kuma,ljudi,mama,sestra,stric,stricek,svoje_ime,tata,teta,ujak,cekaj,da,dobar_dan,dorucak,hocu,hvala,ku-ku_2,kupanje,laku_noc,molim,ne,nemoj,pa-pa,ringe-raja,rucak,spavanje,psst-tiho,vecera,bok_bog,baciti,bjezati,brisati,crtati,cistiti,citati,dati,dirati,dobiti,donijeti,gledati,gristi,gurati,hodati,ici,igrati_se,jahati,jesti,kupiti,ljuljati_se,otvoriti,pasti,paziti,pisati,piti,pjevati,plakati,plesati,plivati,poljubiti,pomoci,prati,prskati,puhati,razbiti,reci,skociti,smijati_se,spavati,stati,staviti,skakljati,sutnuti,trcati,trgati,tuci,udariti_se,uzeti,vidjeti,voljeti,voziti,vuci,zagrliti,zatvoriti,zuriti,dan,danas,jutro,noc,poslije,sad(a),sutra,veceras,bolestan,boli,brz(o),crvena,cist(o),divno,dobro,fin(o),gladan,gotovo,hladno,lijep(o),krasno,mal(o),mekan(o),mokro,mrak,njezno,plava,polako,pospan(o),prazno,prljav(o),ruzno,spava,sretan,star(o),strasno,suh(o),tesko,toplo,umoran,velik(o),vruce,zima,zlocest(o),zedan,ja,meni,moj,njegov,njezin,ono,ovo,tebi,ti,to,tvoj,gdje,kad(a),kako,sto,tko,zasto,dolje,gore,ispod,ispred,iza,na,tamo,tu,u,unutra,van(i),drugi,jos,malo,nema,nista,puno,sve,vise
1,2,13,2,2,1,2,2,1,2,1,2,1,2,2,2,1,0,1,1,0,1,0,0,0,0,2,0,1,2,1,0,2,1,0,1,2,1,2,0,2,1,0,0,0,0,1,1,2,0,0,2,1,1,1,0,0,0,1,0,1,1,1,1,0,2,2,2,2,2,2,2,1,0,1,1,2,2,1,1,1,0,2,2,1,1,1,1,1,1,1,1,1,0,2,0,0,2,1,1,0,1,1,0,0,2,2,0,1,1,1,1,1,1,1,0,1,0,1,2,1,0,1,0,2,1,2,0,2,0,1,2,2,1,2,2,1,0,0,0,1,1,2,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,2,0,2,1,2,2,1,0,1,1,1,2,1,2,2,1,1,1,1,0,2,1,2,1,0,2,2,1,1,1,1,1,1,1,0,1,1,1,2,0,1,1,1,1,0,0,1,1,0,0,1,0,1,0,0,1,0,1,1,0,1,1,2,0,1,0,2,2,0,1,1,1,2,1,1,0,1,1,2,0,0,1,1,2,2,2,1,2,0,0,2,1,2,1,0,1,2,1,2,0,0,2,1,0,1,2,1,1,0,1,0,2,1,1,1,1,1,1,1,2,1,1,1,0,1,2,2,2,0,1,1,1,1,0,1,0,1,0,1,1,1,1,1,2,1,1,1,0,1,0,0,1,1,1,1,1,0,1,1,0,1,0,0,1,0,0,0,0,0,1,1,0,1,0,1,1,1,1,1,1,0,1,0,1,1,0,0,1,0,1,1,0,1,0,0,0,0,1,1,0,1,1,1,1,1,2,1,2,0,0,2,2,1,2,2,2,2,1,1,2,0,0,1,1,1,0,1,1,1,1,1,1,2,1,2,1,2,1,2,1,1
2,1,16,2,2,0,1,1,1,1,1,2,2,2,2,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,1,0,0,1,0,0,1,0,1,1,1,1,1,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,2,0,0,0,1,0,2,2,0,0,2,2,2,0,1,0,0,0,1,1,0,1,0,0,1,1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,1,1,1,0,0,0,1,0,1,0,1,0,1,1,1,1,1,0,1,0,1,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
jflanaga commented 2 years ago

@HenryMehta @vmarchman Ok, something like this then (and it will just be 1s and 0s in the csv: this is just the view in RStudio so it's easier to see the data).

ID    gender   age   `ai ai` `bää-ä bää-ä` `hau hau` `kvaak kvaak`   murrr
   <dbl> <dbl+lbl> <dbl> <dbl+lbl>     <dbl+lbl> <dbl+lbl>     <dbl+lbl> <dbl+l>
 1  1 [girl]    18 0 [ei]        1 [kyllä] 1 [kyllä]     1 [kyllä] 0 [ei] 
 1  1 [girl]    24 0 [ei]        1 [kyllä] 1 [kyllä]     1 [kyllä] 0 [ei] 
 2  1 [girl]    18 1 [kyllä]     1 [kyllä] 1 [kyllä]     1 [kyllä] 1 [kyl…
 2  1 [girl]    24 1 [kyllä]     1 [kyllä] 1 [kyllä]     1 [kyllä] 1 [kyl

There's just two things I'm not sure how you (maybe Virigina here) want to handle. First, the files say 18 and 24 (and the article says 1;6 and 2;0), but there isn't a DOB. The book says that in such cases you went with the age that is reported in the data. I assume we'll do that here as well? There's also the issue about ID 29. In WS, it's coded as "girl" but in WG, it's "boy".

HenryMehta commented 2 years ago

If we have age in months at time of administration, we don't need DOB so we'll use that as you say.

With sex, when we have a discrepancy, the code will create 2 child accounts, one male and one female. even when the id is the same.

jflanaga commented 2 years ago

Here it is then. I suppose it can get corrected later if we get updated information.

finnish_ws_wide_data.csv

HenryMehta commented 2 years ago

ok - so based on files above, I believe this should be the file containing the words are the english (I've not added any uni_lemmas)

[finnish_ws].csv

HenryMehta commented 2 years ago

I think this should be the field file

finish_ws_suvi_fields.csv

HenryMehta commented 2 years ago

And I think this is the values file

FinishWS_Suvi_values.csv

Would anyone like to review before I try loading them?

jflanaga commented 2 years ago

@HenryMehta The finnish_ws.csv file might have some issues, depending upon what you want. I pulled the words directly from the parental reports (rather than from the data we have). The only place where it might be an issue is with the "sounds/animal sounds" category. There's some Finnish clarifying text in them (i.e., drnn drnn (auton ääni), where _autonääni means the sound of a car. I don't know if you want to keep the Finnish. Also, the category variable is repeated twice (once lower-case, once upper). And I left the categories in Finnish (again, they are taken directly from the form) because I didn't know whether the default is to use the original categories in the original language, whether they should be translated, or whether there is a standardized set of categories.

jflanaga commented 2 years ago

I also didn't add the Finnish clarifying text in the data I uploaded. So in the finnish_ws_wide_data.csv, it's just drnn drnn, not drnn drnn (auton ääni). And I think that @alvinwmtan is handling the glosses as well. Based on the online spreadsheet, it sounds as if there's a native Finnish speaker checking the translations, so I'd go with what they decide.

HenryMehta commented 2 years ago

I need the field column in the _field.csv file to match with the item column in [finish_ws].csv to match with column title in _data.csv.

If they don't match, the upload will not work

jflanaga commented 2 years ago

Yeah, I'm sure they need to match. I didn't think of the spreadsheet with the words/categories/etc. being finished yet (there's still Finnish in there). I can make whatever changes people want (i.e, add Finnish clarifying text to the data), but I'm not sure whether that is wanted. I probably should have clarified that the one spreadsheet with the metadata wasn't entirely ready yet.

HenryMehta commented 2 years ago

ok

jflanaga commented 2 years ago

@HenryMehta The values for the sex/gender category aren't correct in the file you sent. In this data, 0 = male, 1 = female. The values you have for word is NA for 0. In the SPSS file, 0 is coded as "ei" (no) anad 1 as "kyllä" (yes) . The variable label is about "produces", so maybe 0 should be "does not produce"? Or was there a general decision to treat them as NAs (I guess it doesn't really matter with this kind of data, since there's no ambiguity between a child not producing/saying a word versus data on that word not being available, unlike with say, covid cases, where there's a crucial distinction between 0 and NA).

jflanaga commented 2 years ago

@HenryMehta @vmarchman @alvinwmtan

I'm not sure if you need/want this, but here's a spreadsheet containing the categories that are found in the Finnish WS, the categories from the 1993 WS (based upon Fenson et. al. Variability in Early Communicative Development) and the categories that are currently in wordbank for American English. The Finnish versions of WS seems like it was an adaption of the 1993 WS. The category names are very similar, except for some differences between English and Finnish (the Finnish category is just Quantifiers, not Quantifiers and Articles) and some terminology based upon traditionn (The Finnish form uses the term "particle" for "connecting words"). The only category I don't know how to classify is LUONTO JA LÄHIYMPÄRISTÖ, which can be roughly translated as Nature and Surroundings. I think that's two separate categories in the English forms, so I didn't put anything for it.

finnish-ws-categories-correspondences-translations.xlsx

vmarchman commented 2 years ago

Not sure where we are at, but yes, the scores for "produces" should be 1 = yes, and 0 = no. the scores should be the sum of all "yes" responses.

On Fri, Mar 25, 2022 at 5:15 AM Joseph Flanagan @.***> wrote:

@HenryMehta https://github.com/HenryMehta The values for the sex/gender category aren't correct in the file you sent. In this data, 0 = male, 1 = female. The values you have for word is NA for 0. In the SPSS file, 0 is coded as "ei" (no) anad 1 as "kyllä" (yes) . The variable label is about "produces", so maybe 0 should be "does not produce"? Or was there a general decision to treat them as NAs (I guess it doesn't really matter with this kind of data, since there's no ambiguity between a child not producing/saying a word versus data on that word not being available, unlike with say, covid cases, where there's a crucial distinction between 0 and NA).

— Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/232#issuecomment-1078972482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2TUTFBEFWXXANYEUXQFODVBWU6HANCNFSM5GOP4KGA . You are receiving this because you were mentioned.Message ID: @.***>

jflanaga commented 2 years ago

@vmarchman @HenryMehta @alvinwmtan The weekend is about to start here so I'll just wrap some things on my end. Here is what I think we still have to decide.

  1. The original data had English words as variable names and Finnish as the variable labels. I reshaped the data so that Finnish words are now the variables with possible values of 0 or 1. Each observation is a separate adminstration, so there are two rows per child (one at 15 months, the other at 24). I also corrected some mistakes with the variable labels (some Finnish words were associated with the wrong English one). The English glosses are now separate from the data (see the point about glosses and unilemmas below.
  2. Some of the words in the Sound Effects have clarfiying text on the actual parental report (e.g., drnn drnn (auton ääni).) I don't think the SPSS file had it, but if it did, I didn't put the clarifying text. I only put the word in the data files. So I guess the question is whether the clarifying text should be there or not. If so, I need to add it. If not, it needs to get removed from the file that contains the word, gloss, category, and unilemma information.
  3. I'm not sure what the status of the glosses and unilemmas is. I think that @alvinwmtan is handling it. There were several different issues here, but they looked they were being addressed.
  4. The values for the sex/gender category should be 0 = male, 1 = female. Right now, they are the other way around.
  5. Right now, the value for a particular word variable is either 0 or 1. 1 corresponds to produces but 0 is currently NA. This might be a general principle with the database. As far as I can tell, the options for a word are produces, "", and NA. (this is when you use the get_instrument_data() from wordbankr. I don't know what the difference is between "" and NA.
  6. I don't know what we want to do about the categories. I put the original Finnish categories and a couple of alternatives on the spreadsheet finnish-ws-categories-correspondences-translations.xlsx above. I imagine that this could be handled with the glosses and unilemma, as it involves Finnish-English translations.
  7. We're still waiting on whether comprehension data is available for Words and Gestures (both the short and the long forms)
  8. One of the participants (29) is coded as being "male" in Words and Gestures but as "female" in Words and Sentences. I don't know whether we ask if it is known which is correct, we just leave it as being both, or we remove it.

I know it is not good practice to manually edit files, but I was either too lazy or not good enough to make all the corrections in code. So I cleaned up the file. Here is the file I worked with, after making the corrections. A spreadsheet identifying the corrections is in the zip file, as is the R script used to parse the data.

finnish-cdi-ws-long-form-corrected.zip

alvinwmtan commented 2 years ago

@HenryMehta @jflanaga Thanks for working on this; apologies for not replying sooner as I was away for the weekend.

It might be helpful for me to provide some higher-level context for @jflanaga. Data handling for Wordbank can largely be divided into two halves: 1. form specification and 2. data import.

  1. For new forms (e.g. Finnish), we draw up a form specification which contains things like the item, gloss, unilemma, category, and so on (i.e. form-level information). This helps us to know which items are included in the form, and what they mean. This is where the Google Spreadsheet comes in, and where things like correcting misspellings, mistranslations etc. actually matter.
  2. For the actual importing of the data itself, we are mostly concerned with putting all the data (i.e. administration- and child-level information) in a format that is consistent and readable. Internally, this will be joined with the form specification, so the actual labels here don't matter too much as long as they can be matched with the form specification.

Additionally, "" and NA are different in that "" means the child doesn't say/understand a given word, whereas NA means that we don't have data for this cell (e.g. ambiguous marking on the form, or missing parts of the form).

alvinwmtan commented 2 years ago

On specific issues with Finnish, child 29 should be male to match with the demographic information in the provided paper. I've changed this in the data file, but perhaps we should double-check with the source to make sure it's correct (@vmarchman)?

Otherwise, I've tidied the WS long and short forms, and reordered the variables according to the Google Spreadsheet order. The files are here: [FinnishWS].csv [FinnishWSShort].csv FinnishWS_Stolt_data.csv FinnishWS_Stolt_fields.csv FinnishWS_Stolt_values.csv FinnishWSShort_Stolt_data.csv FinnishWSShort_Stolt_fields.csv FinnishWSShort_Stolt_values.csv

I've also tidied the WG data, but we should wait to have more clarity on the comprehension/production issue. It's perhaps also worth noting that the four items which contain ymmärtää (understands) in their variable label only occur for the 15mo data, whereas the parallel item for 12mo instead contains sanoo (says), and there is no parallel item in 15mo that contains sanoo. Note also that the original paper suggests that both receptive and expressive vocabularies were collected for these children.

HenryMehta commented 2 years ago

@alvinwmtan the definition files don't contain a choices column, which I need, although I note only produces is given in the values files. So I'm going to progress on assumption that all data are produces or doesn't produce

HenryMehta commented 2 years ago

@alvinwmtan @jflanaga This is now deployed to the Wordbank2 environment

Joe, this is an upgrade that is underway so will not show in Wordbank.stanford.edu just yet

alvinwmtan commented 2 years ago

@HenryMehta Yup, "produces;doesn't produce" is correct. Thank you!

vmarchman commented 2 years ago

Ok - I don't know how much this is going to set us back, but Suvi Stolt has confirmed these data are only production (even though there is comprehension data available for the WG forms, but they are not in these files). To be honest, I'm not sure how much more tweaking she has done with the files, so I think it's best to start over with these new files.

@alvinwmtan after Monday, can you do your conversion to get these in the right format, i.e., 1 participant per instrument per line (rather than multiple administrations per line)?

Finnish Long form versions of the CDI WS at 18 months and at 24 months of age_ Expressive words.csv Finnish Long form versions of the CDI WG at 12 and at 15 months of age _ Expressive words.csv Finnish Short form versions of the CDI Toddler version at 18 and at 24 months _ Expressive words.csv Finnnish Short form versions of the CDI Infant version at 9, 12, 15, 18 months _ Expressive words.csv

alvinwmtan commented 2 years ago

@HenryMehta Hi Henry, some updates on Finnish:

  1. Please remove the old Finnish data; the updated data files are here: FinnishWS_Stolt_data.csv FinnishWS_Stolt_fields.csv FinnishWS_Stolt_values.csv

  2. Please also remove the FinnishWSShort instrument entirely; we have decided not to put up short forms for now and will return to them at another point.

  3. I'll put the FinnishWG, FinnishWGShort, and FinnishWSShort into a new issue, which we'll leave until after Wordbank 2

HenryMehta commented 2 years ago

@alvinwmtan

  1. Loaded
  2. I've deleted every except for the table instrument_finnish_ws_short, although I have deleted the data in it. I've left the table itself because removing it will cause Django issues since we've had other database changes since it was created
  3. ok
alvinwmtan commented 2 years ago

FinnishWS_Stolt_fields.csv

@HenryMehta apologies—I made a mistake in the fields file, which caused an error when importing the item "clean". This one should be correct.

HenryMehta commented 2 years ago

FinnishWS_Stolt_fields.csv @alvinwmtan I had to make some changes as per attached, this but now deployed

alvinwmtan commented 2 years ago

@HenryMehta looks good from R

vmarchman commented 1 year ago

@Henry @.> @Michael C Frank @.> I can convert these spss files to csv.

Hmmm. I'm looking at these files now. They are longitudinal data within the same data set but not with cases for repeat admins long, but rather in wide format. I think you want them in a bit different format so, it will take me a bit to convert them, but I'll do it asap.

We will also need confirmation of how sex is coded. I'll email Suvi.

On Sat, Mar 19, 2022 at 12:59 AM Henry Mehta @.***> wrote:

@mcfrank https://github.com/mcfrank I have nothing to open a sav file. Can you advice what program I should be using

— Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/232#issuecomment-1072964650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2TUTDDWJSOA7MECPDZNBTVAWCPNANCNFSM5GOP4KGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

--


Virginia A. Marchman, Ph.D. @.*** Research Associate Language Learning Lab Psychology Stanford University Stanford, CA 94305 Direct: 650-725-7493 Lab: 650-723-1257


alvinwmtan commented 1 year ago

child_id is incorrectly entered in this dataset; needs a relook

alvinwmtan commented 1 year ago

[FinnishWS].csv Also corrected the category prepositions -> locations

HenryMehta commented 9 months ago

@alvinwmtan I'm not sure what I'm meant to be doing on this file. The FinnishWS file above doesn't have an item column in it. These are the normal headers: itemID,category,item,definition,choices,type,gloss,uni_lemma

And is that all I'm meant to do? Save the file and reload the Finnish WS data?