tavareshugo / tutorial_DESeq2_contrasts

53 stars 14 forks source link

Error in using contrasts and model matrix as shown in the tutorial #6

Open hok-lee opened 2 years ago

hok-lee commented 2 years ago

First of all, your tutorial is great. It gave me great insights about the inner working of DESeq2.

However, when I tried to follow the tutorial but did not get the result expected when I used the model matrix. I wrote some sample codes (R markdown and the resulting HTML) to show the disagreement. https://github.com/hok-lee/error_tutorial_deseq_contrast.git

I am quite puzzled. I hope you could help. Thank you!

tavareshugo commented 2 years ago

Thanks for your question @hok-lee and for providing a nicely reproducible example.

I believe you may have missed one important aspect of interpreting the coefficients of models with two factors. When you do this contrast:

results(dds_age_treatment, contrast = c("age", "30", "20"))

You are comparing the difference between ages 30 and 20 for the reference level of treatment, i.e. for treatment A only.

To define these contrasts with numeric vectors, you should create numeric vectors for every combination of both factors you want to account for:

matrix_model <- model.matrix(design(dds_age_treatment), colData(dds_age_treatment))
age_20_treatment_a <- colMeans(matrix_model[dds_age_treatment$age == "20" & dds$treatment == "A", ])
age_30_treatment_a <- colMeans(matrix_model[dds_age_treatment$age == "30" & dds$treatment == "A", ])

results(dds_age_treatment, contrast = age_30_treatment_a - age_20_treatment_a)

The same logic then applies to other comparisons.

The comparison you were doing:

results(dds_age_treatment, contrast = array_for_contrast_30 - array_for_contrast_20 )

Is for the average difference between 30 and 20 across all the levels of treatment. There's similar question recently on stackoverflow that may help illustrate this (see the second diagram, which illustrates the equivalent of your comparison between the average age difference across both levels of treatment).

I'll leave this issue open, as I may add a note to the materials to highlight this for future readers.

hok-lee commented 2 years ago

Thank you for the detailed explanation. I understand the issue now, it makes a lot more sense with the subsetting of the model matrix on both the age and the treatment condition. I appreciate you taking the time to explain it to me!

joowkim commented 4 months ago

Hi @tavareshugo , I have a quick question about interpreting the coefficient.

""" The key point to remember about designs with interaction terms is that, unlike for a design ~genotype + condition, where the condition effect represents the overall effect controlling for differences due to genotype, by adding genotype:condition, the main condition effect only represents the effect of condition for the reference level of genotype (I, or whichever level was defined by the user as the reference level).

The interaction terms genotypeII.conditionB and genotypeIII.conditionB give the difference between the condition effect for a given genotype and the condition effect for the reference genotype. """

This is from Deseq2 vignettes. With interaction terms, the difference between different conditions only represents for the reference level of genotype. Without the interaction terms, it is the overall difference between the different conditions.

The design, design(dds_age_treatment) <- ~age + treatment, doesn't have an interaction term, however, the contrast, results(dds_age_treatment, contrast = c("age", "30", "20")), is for difference between 30 and 20 for treatment A (reference level).

This contradicts what the vignettes says... I'm a bit confused and am making a wild guess that this is because the design is an imbalanced design. (6 samples are age10 and 7 samples are age30)

> model.matrix(design(dds_age_treatment), colData(dds_age_treatment))
         (Intercept) age10 age30 treatmentB
sample1            1     1     0          0
sample2            1     0     1          0
sample3            1     0     0          0
sample4            1     0     1          0
sample5            1     0     1          0
sample6            1     0     1          0
sample7            1     1     0          0
sample8            1     0     0          0
sample9            1     0     0          0
sample10           1     0     1          0
sample11           1     0     0          1
sample12           1     0     1          1
sample13           1     1     0          1
sample14           1     0     0          1
sample15           1     0     0          1
sample16           1     0     1          1
sample17           1     1     0          1
sample18           1     1     0          1
sample19           1     1     0          1
sample20           1     0     0          1
tavareshugo commented 4 months ago

Yes, it can be a bit confusing and I might have made it more confusing with my earlier solution.

Anyway, let's take a step back and consider what the terms in the model mean. Your model is additive only (no interaction): ~ age + treatment

You have the following levels in each variable:

So, your model will have the following coefficients:

Now, because there is no interaction, the coefficient age10 is assumed to be the same for both treatment groups. I.e. regardless of whether you're in treatmentA or treatmentB, the difference age10 - age20 is assumed to be the same. The same goes for age30.

Equally, the coefficient treatmentB is also assumed to be the same, regardless of age.

If that assumption doesn't seem right to you, then you should fit a model with an interaction. At which point you will have more coefficients and therefore more possible contrasts.

joowkim commented 4 months ago

Thank you for taking the time to explain it to me. I should probably review your tutorial materials as well as my linear regression book again. Thanks again and have a great weekend.