Open hok-lee opened 2 years ago
Thanks for your question @hok-lee and for providing a nicely reproducible example.
I believe you may have missed one important aspect of interpreting the coefficients of models with two factors. When you do this contrast:
results(dds_age_treatment, contrast = c("age", "30", "20"))
You are comparing the difference between ages 30 and 20 for the reference level of treatment, i.e. for treatment A only.
To define these contrasts with numeric vectors, you should create numeric vectors for every combination of both factors you want to account for:
matrix_model <- model.matrix(design(dds_age_treatment), colData(dds_age_treatment))
age_20_treatment_a <- colMeans(matrix_model[dds_age_treatment$age == "20" & dds$treatment == "A", ])
age_30_treatment_a <- colMeans(matrix_model[dds_age_treatment$age == "30" & dds$treatment == "A", ])
results(dds_age_treatment, contrast = age_30_treatment_a - age_20_treatment_a)
The same logic then applies to other comparisons.
The comparison you were doing:
results(dds_age_treatment, contrast = array_for_contrast_30 - array_for_contrast_20 )
Is for the average difference between 30 and 20 across all the levels of treatment. There's similar question recently on stackoverflow that may help illustrate this (see the second diagram, which illustrates the equivalent of your comparison between the average age difference across both levels of treatment).
I'll leave this issue open, as I may add a note to the materials to highlight this for future readers.
Thank you for the detailed explanation. I understand the issue now, it makes a lot more sense with the subsetting of the model matrix on both the age and the treatment condition. I appreciate you taking the time to explain it to me!
Hi @tavareshugo , I have a quick question about interpreting the coefficient.
""" The key point to remember about designs with interaction terms is that, unlike for a design ~genotype + condition, where the condition effect represents the overall effect controlling for differences due to genotype, by adding genotype:condition, the main condition effect only represents the effect of condition for the reference level of genotype (I, or whichever level was defined by the user as the reference level).
The interaction terms genotypeII.conditionB and genotypeIII.conditionB give the difference between the condition effect for a given genotype and the condition effect for the reference genotype. """
This is from Deseq2 vignettes
. With interaction terms, the difference between different conditions only represents for the reference level of genotype. Without the interaction terms, it is the overall difference between the different conditions.
The design, design(dds_age_treatment) <- ~age + treatment
, doesn't have an interaction term, however, the contrast, results(dds_age_treatment, contrast = c("age", "30", "20"))
, is for difference between 30
and 20
for treatment A (reference level).
This contradicts what the vignettes says... I'm a bit confused and am making a wild guess that this is because the design is an imbalanced design. (6 samples are age10 and 7 samples are age30)
> model.matrix(design(dds_age_treatment), colData(dds_age_treatment))
(Intercept) age10 age30 treatmentB
sample1 1 1 0 0
sample2 1 0 1 0
sample3 1 0 0 0
sample4 1 0 1 0
sample5 1 0 1 0
sample6 1 0 1 0
sample7 1 1 0 0
sample8 1 0 0 0
sample9 1 0 0 0
sample10 1 0 1 0
sample11 1 0 0 1
sample12 1 0 1 1
sample13 1 1 0 1
sample14 1 0 0 1
sample15 1 0 0 1
sample16 1 0 1 1
sample17 1 1 0 1
sample18 1 1 0 1
sample19 1 1 0 1
sample20 1 0 0 1
Yes, it can be a bit confusing and I might have made it more confusing with my earlier solution.
Anyway, let's take a step back and consider what the terms in the model mean.
Your model is additive only (no interaction): ~ age + treatment
You have the following levels in each variable:
age
with 3 levels: 20 (ref), 10 and 30treatment
with 2 levels: A (ref) and BSo, your model will have the following coefficients:
Intercept
= the average for the reference group A/20age10
= the difference between age10 and age20age30
= the difference between age30 and age20treatmentB
= the difference between treatmentB and treatmentANow, because there is no interaction, the coefficient age10
is assumed to be the same for both treatment groups. I.e. regardless of whether you're in treatmentA or treatmentB, the difference age10 - age20
is assumed to be the same. The same goes for age30
.
Equally, the coefficient treatmentB
is also assumed to be the same, regardless of age.
If that assumption doesn't seem right to you, then you should fit a model with an interaction. At which point you will have more coefficients and therefore more possible contrasts.
Thank you for taking the time to explain it to me. I should probably review your tutorial materials as well as my linear regression book again. Thanks again and have a great weekend.
First of all, your tutorial is great. It gave me great insights about the inner working of DESeq2.
However, when I tried to follow the tutorial but did not get the result expected when I used the model matrix. I wrote some sample codes (R markdown and the resulting HTML) to show the disagreement. https://github.com/hok-lee/error_tutorial_deseq_contrast.git
I am quite puzzled. I hope you could help. Thank you!