langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

Instrument data but no adminstrative data #235

Closed jflanaga closed 2 years ago

jflanaga commented 2 years ago

Sorry, I'm probably doing something wrong, but I couldn't find it in the documentation. For a particular search, I'm getting instrument data that doesn't contain corresponding adminstrative data (for at least the word_ending_nouns type of the American English Words and Sentences form. Here's a reproducible example

library(wordbankr)
library(tidyverse)

items_eng_ws <- get_item_data(language = "English (American)", form = "WS")

word_endings_ids <- items_eng_ws %>%
  filter(type == "word_endings_nouns") %>%
  pull(item_id)

word_endings <- get_instrument_data(language = "English (American)", 
                                    form = "WS",
                                    items = all_of(word_endings_ids), 
                                    administrations = TRUE, 
                                    iteminfo = TRUE)

word_endings2 <- get_instrument_data(language = "English (American)", 
                                     form = "WS", 
                                     items = all_of(word_endings_ids),
                                     iteminfo = TRUE)

missing_admin_data <- anti_join(word_endings2, word_endings, by = "data_id")

Just to make sure that the data wasn't there, I tried to search for one of the missing data_id on the full adminstrative data:

admins <- get_administration_data()
admins %>% filter(data_id == "131896")

I got an empty tibble.

Ís this expected behaviour? I didn't see anything in the documentation about lacking adminstrative data in this manner.

mcfrank commented 2 years ago

Hi Joseph, thanks for the clearly documented explanation. That is not expected behavior, it looks to me like some data are being orphaned from their admin data on import.

I checked the rows in the raw data vs. rows in the database for the Eng WS datasets and surprisingly a number of them are different, including Marchman Dallas, Marchman Wisconsin, and both Smith datasets. @HenryMehta it looks like some admin data are getting cut off of their records, do you think you could look into why? @mikabr maybe you remember something about this?

On Thu, Dec 16, 2021 at 1:34 AM Joseph Flanagan @.***> wrote:

Sorry, I'm probably doing something wrong, but I couldn't find it in the documentation. For a particular search, I'm getting instrument data that doesn't contain corresponding adminstrative data (for at least the word_ending_nouns type of the American English Words and Sentences form. Here's a reproducible example

library(wordbankr)

library(tidyverse)

items_eng_ws <- get_item_data(language = "English (American)", form = "WS")

word_endings_ids <- items_eng_ws %>%

filter(type == "word_endings_nouns") %>%

pull(item_id)

word_endings <- get_instrument_data(language = "English (American)",

                                form = "WS",

                                items = all_of(word_endings_ids),

                                administrations = TRUE,

                                iteminfo = TRUE)

word_endings2 <- get_instrument_data(language = "English (American)",

                                 form = "WS",

                                 items = all_of(word_endings_ids),

                                 iteminfo = TRUE)

missing_admin_data <- anti_join(word_endings2, word_endings, by = "data_id")

Just to make sure that the data wasn't there, I tried to search for one of the missing data_id on the full adminstrative data:

admins <- get_administration_data()

admins %>% filter(data_id == "131896")

I got an empty tibble.

Ís this expected behaviour? I didn't see anything in the documentation about lacking adminstrative data in this manner.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI25F3MRL2HH77SCH7YSB3URGXAPANCNFSM5KFZTB6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jflanaga commented 2 years ago

Hi, Sorry, I thought I would check to see if there was some kind of update on this? I took at look at some of the raw data in the repo for the book (data/psychometrics/eng_ws_raw_data.Rds). I then got data using wordbankr in two different ways (both the admin data and without it:

df <- get_instrument_data(language = "English (American)", 
                          form = "WS",
                          administrations = TRUE, 
                          iteminfo = TRUE)

df2 <- get_instrument_data(language = "English (American)", 
                          form = "WS",
                          iteminfo = TRUE)

df2 has 4,659,262 observations whereas both df and eng_ws (from the .Rds) have 4,399,440, a difference of 259,822. (There appears to be 326 missing adminstrations in the smaller datasets).

The data_id's in eng_ws aren't found in any of the datasets I got from using wordbankr. However, I selected the columns found in eng_ws and df dataframe and found the two were identical. So I imagine that the raw data used for the book (or at least the data in the data/psychometrics directory) used the adminstrations = TRUE option for get_adminstrative_data(). I thought I would let you know about the latter issue.

mcfrank commented 2 years ago

Hi Joseph, sorry - some of the blocker here is that we are redoing the database and there are some revisions around dataset names so this has been up in the air.

I believe you have identified a bug where there is no administration data getting created for some datasets. @HenryMehta  could you look into this? If you search the ID that Joseph put in above, you should be able to find a missing source...

jflanaga commented 2 years ago

No worries. I was just curious about the status, and I wanted to let you know that if you were working with W&S from that .Rds in current work, you’d be missing data. If need be, I could give the IDs of all missing admins.

mcfrank commented 2 years ago

one hypothesis is that this is about filtering administrations outside of the instrument's appropriate age range, and one call does that but the other doesn't?

jflanaga commented 2 years ago

Yeah, I suspect that is it. I haven't had a chance yet to do much, but I was at least able to locate the missing data referenced above with the following:

admins <- get_administration_data(filter_age = FALSE)
admins |> 
  filter(data_id == "131896")

The age in 131896 is 31, whereas the maximum age with the default filter_age = TRUE is 30.

The issue is that when you specify an option to get adminstrative data in get_instrument_data(), there's not an option to override the default filter_age = TRUE in get_adminstrative_data(). It's not a major issue, as you should be able to do a join to get the needed information, once you know what's going on. I should have looked at the internals more closely.