Closed jflanaga closed 2 years ago
Hi Joseph, thanks for the clearly documented explanation. That is not expected behavior, it looks to me like some data are being orphaned from their admin data on import.
I checked the rows in the raw data vs. rows in the database for the Eng WS datasets and surprisingly a number of them are different, including Marchman Dallas, Marchman Wisconsin, and both Smith datasets. @HenryMehta it looks like some admin data are getting cut off of their records, do you think you could look into why? @mikabr maybe you remember something about this?
On Thu, Dec 16, 2021 at 1:34 AM Joseph Flanagan @.***> wrote:
Sorry, I'm probably doing something wrong, but I couldn't find it in the documentation. For a particular search, I'm getting instrument data that doesn't contain corresponding adminstrative data (for at least the word_ending_nouns type of the American English Words and Sentences form. Here's a reproducible example
library(wordbankr)
library(tidyverse)
items_eng_ws <- get_item_data(language = "English (American)", form = "WS")
word_endings_ids <- items_eng_ws %>%
filter(type == "word_endings_nouns") %>%
pull(item_id)
word_endings <- get_instrument_data(language = "English (American)",
form = "WS", items = all_of(word_endings_ids), administrations = TRUE, iteminfo = TRUE)
word_endings2 <- get_instrument_data(language = "English (American)",
form = "WS", items = all_of(word_endings_ids), iteminfo = TRUE)
missing_admin_data <- anti_join(word_endings2, word_endings, by = "data_id")
Just to make sure that the data wasn't there, I tried to search for one of the missing data_id on the full adminstrative data:
admins <- get_administration_data()
admins %>% filter(data_id == "131896")
I got an empty tibble.
Ís this expected behaviour? I didn't see anything in the documentation about lacking adminstrative data in this manner.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI25F3MRL2HH77SCH7YSB3URGXAPANCNFSM5KFZTB6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hi,
Sorry, I thought I would check to see if there was some kind of update on this? I took at look at some of the raw data in the repo for the book (data/psychometrics/eng_ws_raw_data.Rds
). I then got data using wordbankr
in two different ways (both the admin data and without it:
df <- get_instrument_data(language = "English (American)",
form = "WS",
administrations = TRUE,
iteminfo = TRUE)
df2 <- get_instrument_data(language = "English (American)",
form = "WS",
iteminfo = TRUE)
df2
has 4,659,262 observations whereas both df
and eng_ws
(from the .Rds) have 4,399,440, a difference of 259,822.
(There appears to be 326 missing adminstrations in the smaller datasets).
The data_id's in eng_ws
aren't found in any of the datasets I got from using wordbankr. However, I selected the columns found in eng_ws
and df
dataframe and found the two were identical. So I imagine that the raw data used for the book (or at least the data in the data/psychometrics
directory) used the adminstrations = TRUE
option for get_adminstrative_data()
. I thought I would let you know about the latter issue.
Hi Joseph, sorry - some of the blocker here is that we are redoing the database and there are some revisions around dataset names so this has been up in the air.
I believe you have identified a bug where there is no administration data getting created for some datasets. @HenryMehta could you look into this? If you search the ID that Joseph put in above, you should be able to find a missing source...
No worries. I was just curious about the status, and I wanted to let you know that if you were working with W&S from that .Rds in current work, you’d be missing data. If need be, I could give the IDs of all missing admins.
one hypothesis is that this is about filtering administrations outside of the instrument's appropriate age range, and one call does that but the other doesn't?
Yeah, I suspect that is it. I haven't had a chance yet to do much, but I was at least able to locate the missing data referenced above with the following:
admins <- get_administration_data(filter_age = FALSE)
admins |>
filter(data_id == "131896")
The age in 131896
is 31, whereas the maximum age with the default filter_age = TRUE
is 30.
The issue is that when you specify an option to get adminstrative data in get_instrument_data()
, there's not an option to override the default filter_age = TRUE
in get_adminstrative_data()
. It's not a major issue, as you should be able to do a join to get the needed information, once you know what's going on. I should have looked at the internals more closely.
Sorry, I'm probably doing something wrong, but I couldn't find it in the documentation. For a particular search, I'm getting instrument data that doesn't contain corresponding adminstrative data (for at least the
word_ending_nouns
type of the American English Words and Sentences form. Here's a reproducible exampleJust to make sure that the data wasn't there, I tried to search for one of the missing
data_id
on the full adminstrative data:I got an empty tibble.
Ís this expected behaviour? I didn't see anything in the documentation about lacking adminstrative data in this manner.