manybabies / mb1-analysis-public

ManyBabies1 analysis code for public sharing
MIT License
6 stars 15 forks source link

Contrast coding not explicitly given #4

Closed palday closed 5 years ago

palday commented 5 years ago

I could not find anywhere in the R code nor accompanying text a specification of the contrast coding. For most factorial designs in psycholinguistics, effect/sum/deviation coding would be more appropriate than the default treatment coding. But in either case, you need to specify in the manuscript the contrast coding because the coefficients are not interpretable without it. Also, you should explicitly convert categorical variables to factors and not just leave them as characters because this can have weird knock-on effects elsewhere.

The default contr.sum in R has a opaque naming scheme, but the car package has an alternative with a more meaningful naming scheme, contr.Sum.

Here's a brief example from my own tinkering:

library(car)
# the sort-decreasing step makes contr.Sum/contr.sum more consistent with other 
# contrasts in R and contrasts in Julia
contrasts(d_lmer$trial_type) <- contr.Sum(sort(levels(d_lmer$trial_type),decreasing = TRUE))
contrasts(d_lmer$nae) <- contr.Sum(sort(levels(d_lmer$nae),decreasing = TRUE))
contrasts(d_lmer$method) <- contr.Sum(sort(levels(d_lmer$method),decreasing = TRUE))
mcfrank commented 5 years ago

Hi Philip,

Thanks very much for your comments. I agree completely about adding explicit coding information to the ms, and we will do that.

On your other two points:

  1. I honestly have never understood why people prefer to move away from dummy coding most of the time. Maybe I'm a dummy. Seriously, I find the additive interpretation of coefficients from dummy coding much more straightforward. Can you point me towards some arguments, or a recommendation why a particular alternate coding scheme could be useful here?
  2. Why convert strings to factors? stringsasfactors=TRUE https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ is a big discussion point. My impression of the tidyverse workflow is to avoid casting strings to factors whenever possible.

Mike

On Tue, May 21, 2019 at 4:39 AM Phillip Alday notifications@github.com wrote:

I could not find anywhere in the R code nor accompanying text a specification of the contrast coding. For most factorial designs in psycholinguistics, effect/sum/deviation coding would be more appropriate than the default treatment coding. But in either case, you need to specify in the manuscript the contrast coding because the coefficients are not interpretable without it. Also, you should explicitly convert categorical variables to factors and not just leave them as characters because this can have weird knock-on effects elsewhere.

The default contr.sum in R has a opaque naming scheme, but the car package has an alternative with a more meaningful naming scheme, contr.Sum .

Here's a brief example from my own tinkering:

library(car)# the sort-decreasing step makes contr.Sum/contr.sum more consistent with other # contrasts in R and contrasts in Julia contrasts(d_lmer$trial_type) <- contr.Sum(sort(levels(d_lmer$trial_type),decreasing = TRUE)) contrasts(d_lmer$nae) <- contr.Sum(sort(levels(d_lmer$nae),decreasing = TRUE)) contrasts(d_lmer$method) <- contr.Sum(sort(levels(d_lmer$method),decreasing = TRUE))

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/manybabies/mb1-analysis-public/issues/4?email_source=notifications&email_token=AAI25F4KSJVB72L6E232T2DPWPNQBA5CNFSM4HOKDJA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GU6D33A, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI25F3O7KZBL23JDQWOSXLPWPNQBANCNFSM4HOKDJAQ .

palday commented 5 years ago
  1. It matters for the interpretation of interactions, see e.g. Dale Barr's comments or my (admittedly tedious) worked example for a linguistic manipulation.

  2. When combining and manipulating the data as tables, strings are fine and possibly preferred. But when treating them as variables in a model, then factors are definitely preferred (at least for regression models), because that's the representation the model will actually need. And you can't set contrasts on characters, only on factors. (Although you can somewhat get around that with options(contrasts=...) or using the contrasts= argument in modelling functions that support that. Treating known categorical variables as factors also has certain debugging advantages for variables whose string representation may be mistakenly purely numeric (e.g. participant ID can be numeric). Sure, you can force these things to be strings instead, but that's not actually representing the structure of your data (a finite set of labels rather than a selection of potentially infinite strings).

mcfrank commented 5 years ago

OK, thanks for this - right now we are using dummy coding and I think we'll stick with that, but I have processed your example and noted this for future work!