seananderson / glmm-course

Workshop exercises on regression, GLMs, mixed-effects models, and GLMMs in R
171 stars 66 forks source link

Provenance of chopstick data #19

Open bbolker opened 2 years ago

bbolker commented 2 years ago

I'm wondering what the chopstick data really represent. The original reference defines food-pinching efficiency as:

Food-pinching efficiency. The subject would sit on an adjustable seat, and was required to pick up peanuts from a dish (150 mm diameter) in front of the subject (450 mm) to a cup (200 mm high and 70 mm diameter) under the mouth for 1 min. During pinching, the experimenter counted the numbers of peanuts in the cup. The reason for using peanuts was that it was difficult to pick them up and hence was more representative as a measure of the effect of the length of the chopsticks on food pinching efficiency. Fig. 3 demonstrates the workplace layout and task of the food- pinching.

This would suggest count data to me?? Table 5 in the original paper gives mean values by chopstick length that are consistent with the data here, but the values given in this data set are continuous rather than count data. I don't have access to the paper on the Elsevier web site, so I can't see if there is supplemental material there (although I kind of doubt it in a paper from 1991?) The data sets themselves seem to be floating around in a variety of places:

remotes::install_github("jr-packages/jrModelling")
?jrModelling::chopsticks

credits https://bmdatablog.files.wordpress.com/2016/04/chopsticks.pdf , which refers to https://www.udacity.com/api/nodes/4576183932/supplemental_media/chopstick-effectivenesscsv/download; I think these are the same data referred to here: https://towardsdatascience.com/chopstick-length-analysis-2c4c7e9b6136

(The jrModelling data set only has 2 of the 4 chopstick lengths).

There are only 171 unique values out of 186 rows, so I guess it's possible these data are rescaled versions of integer counts?

seananderson commented 2 years ago

Good question!

I likely got the data from https://www.udacity.com/api/nodes/4576183932/supplemental_media/chopstick-effectivenesscsv/download or some clone of that. I don't currently have access to that Elsevier site either to check.

The source for that course appears to be here: https://github.com/udacity/data-analyst/tree/master/projects/chopstick_length and the file was committed by @ShengKungYi

The paper reiterates multiple places that efficiency is "quantity of peanuts picked" and even (helpfully!?) notes in the footnotes of Table 1 "**The greater the score, the better."

My first guess was that they actually ran 2 or 3 trials and averaged them but forgot to mention it. They did repeat the food-pinching force component 3 times. However, multiplying the values by any reasonable number of trials does not create anything close to integers.

On the other hand, multiplying the mean data from Table 1 by the number of subjects (31) does create near-integer data.

library(dplyr)
d <- read.csv("https://raw.githubusercontent.com/seananderson/glmm-course/master/data/raw/chopstick-effectiveness.csv")
group_by(d, Chopstick.Length) %>% 
  summarise(mean = mean(Food.Pinching.Efficiency)) %>% 
  mutate(total = mean * 31) %>% 
  as.data.frame()
#>   Chopstick.Length     mean  total
#> 1              180 24.93516 772.99
#> 2              210 25.48387 790.00
#> 3              240 26.32290 816.01
#> 4              270 24.32387 754.04
#> 5              300 24.96806 774.01
#> 6              330 23.99968 743.99

That makes me wonder if maybe the .csv file was data simulated for the purpose of an example at some point. If so, it's moderately impressive that it matches the Table 1 means exactly, but I guess with enough patience or code...

The data do look a bit cleaner and less right skewed than I would have expected:

library(dplyr)
library(ggplot2)
d <- read.csv("https://raw.githubusercontent.com/seananderson/glmm-course/master/data/raw/chopstick-effectiveness.csv")
ggplot(d, aes(Food.Pinching.Efficiency, 1)) + 
  geom_point(position = position_jitter(width = 0, height = 0.1)) +
  facet_wrap(vars(Chopstick.Length)) +
  ylim(0.8, 1.2)

But, an ANOVA matches the paper's reported results too:

d$Chopstick.Length <- as.factor(d$Chopstick.Length)
d$Individual <- as.factor(d$Individual)
m <- aov(Food.Pinching.Efficiency ~ Chopstick.Length + Error(Individual), data = d)
summary(m)
#> 
#> Error: Individual
#>           Df Sum Sq Mean Sq F value Pr(>F)
#> Residuals 30   2278   75.92               
#> 
#> Error: Within
#>                   Df Sum Sq Mean Sq F value   Pr(>F)    
#> Chopstick.Length   5  106.9  21.372   5.051 0.000262 ***
#> Residuals        150  634.6   4.231                     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

image

Maybe they didn't actually enforce an exact 60 second limit and then rescaled them to be per 60 seconds? That's about all I'm left with. But then why would multiplying the means by 31 result in integers?

bbolker commented 2 years ago

Should we post an issue at the udacity repo? Or try to contact https://github.com/ShengKungYi https://www.linkedin.com/in/sheng-kung-yi ?

ShengKungYi commented 2 years ago

Hi! Well, this is a blast from the past. Unfortunately, I have no knowledge of how the data was collected or how the assignment was constructed. If I remember correctly, I was simply centralizing materials for the course content onto GitHub; I had no part in the assignment's creation. I also don't have any recollection of who would know more about the dataset and assignment. Sorry that I'm a bit of a dead end on this!