Open bbolker opened 2 years ago
Good question!
I likely got the data from https://www.udacity.com/api/nodes/4576183932/supplemental_media/chopstick-effectivenesscsv/download or some clone of that. I don't currently have access to that Elsevier site either to check.
The source for that course appears to be here: https://github.com/udacity/data-analyst/tree/master/projects/chopstick_length and the file was committed by @ShengKungYi
The paper reiterates multiple places that efficiency is "quantity of peanuts picked" and even (helpfully!?) notes in the footnotes of Table 1 "**The greater the score, the better."
My first guess was that they actually ran 2 or 3 trials and averaged them but forgot to mention it. They did repeat the food-pinching force component 3 times. However, multiplying the values by any reasonable number of trials does not create anything close to integers.
On the other hand, multiplying the mean data from Table 1 by the number of subjects (31) does create near-integer data.
library(dplyr)
d <- read.csv("https://raw.githubusercontent.com/seananderson/glmm-course/master/data/raw/chopstick-effectiveness.csv")
group_by(d, Chopstick.Length) %>%
summarise(mean = mean(Food.Pinching.Efficiency)) %>%
mutate(total = mean * 31) %>%
as.data.frame()
#> Chopstick.Length mean total
#> 1 180 24.93516 772.99
#> 2 210 25.48387 790.00
#> 3 240 26.32290 816.01
#> 4 270 24.32387 754.04
#> 5 300 24.96806 774.01
#> 6 330 23.99968 743.99
That makes me wonder if maybe the .csv file was data simulated for the purpose of an example at some point. If so, it's moderately impressive that it matches the Table 1 means exactly, but I guess with enough patience or code...
The data do look a bit cleaner and less right skewed than I would have expected:
library(dplyr)
library(ggplot2)
d <- read.csv("https://raw.githubusercontent.com/seananderson/glmm-course/master/data/raw/chopstick-effectiveness.csv")
ggplot(d, aes(Food.Pinching.Efficiency, 1)) +
geom_point(position = position_jitter(width = 0, height = 0.1)) +
facet_wrap(vars(Chopstick.Length)) +
ylim(0.8, 1.2)
But, an ANOVA matches the paper's reported results too:
d$Chopstick.Length <- as.factor(d$Chopstick.Length)
d$Individual <- as.factor(d$Individual)
m <- aov(Food.Pinching.Efficiency ~ Chopstick.Length + Error(Individual), data = d)
summary(m)
#>
#> Error: Individual
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Residuals 30 2278 75.92
#>
#> Error: Within
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Chopstick.Length 5 106.9 21.372 5.051 0.000262 ***
#> Residuals 150 634.6 4.231
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Maybe they didn't actually enforce an exact 60 second limit and then rescaled them to be per 60 seconds? That's about all I'm left with. But then why would multiplying the means by 31 result in integers?
Should we post an issue at the udacity repo? Or try to contact https://github.com/ShengKungYi https://www.linkedin.com/in/sheng-kung-yi ?
Hi! Well, this is a blast from the past. Unfortunately, I have no knowledge of how the data was collected or how the assignment was constructed. If I remember correctly, I was simply centralizing materials for the course content onto GitHub; I had no part in the assignment's creation. I also don't have any recollection of who would know more about the dataset and assignment. Sorry that I'm a bit of a dead end on this!
I'm wondering what the chopstick data really represent. The original reference defines food-pinching efficiency as:
This would suggest count data to me?? Table 5 in the original paper gives mean values by chopstick length that are consistent with the data here, but the values given in this data set are continuous rather than count data. I don't have access to the paper on the Elsevier web site, so I can't see if there is supplemental material there (although I kind of doubt it in a paper from 1991?) The data sets themselves seem to be floating around in a variety of places:
credits https://bmdatablog.files.wordpress.com/2016/04/chopsticks.pdf , which refers to https://www.udacity.com/api/nodes/4576183932/supplemental_media/chopstick-effectivenesscsv/download; I think these are the same data referred to here: https://towardsdatascience.com/chopstick-length-analysis-2c4c7e9b6136
(The
jrModelling
data set only has 2 of the 4 chopstick lengths).There are only 171 unique values out of 186 rows, so I guess it's possible these data are rescaled versions of integer counts?