tavareshugo / tutorial_DESeq2_contrasts

56 stars 14 forks source link

when I calculate coefficient vectors for each group the results are not the same as when the coefficients estimated by DEseq #9

Closed pariaaliour closed 8 months ago

pariaaliour commented 8 months ago

Dear @tavareshugo, I am currently conducting single-nuclei RNA sequencing and have been following your tutorial closely. However, I've encountered a discrepancy in my results when comparing the extraction of coefficients from the model versus calculating the coefficient vector for each group. Here is the design model I've been using: design <- ~ sex + PMI + batchlib + group_id + region + group_id:region

#pre-filtering
keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]
#assign ref level
dds$group_id <- relevel(dds$group_id, ref="con")
dds$region <- relevel(dds$region, ref = "ocu")
#running DESeq
dds <- DESeq(dds, fitType='local')
resultsNames(dds)
 [1] "Intercept"             "sex_2_vs_1"            "PMI"                   "batchlib_2_vs_1"      
 [5] "batchlib_3_vs_1"       "batchlib_4_vs_1"       "batchlib_5_vs_1"       "batchlib_6_vs_1"      
 [9] "batchlib_7_vs_1"       "group_id_als_vs_con"   "region_med_vs_ocu"     "region_sc_vs_ocu"     
[13] "region_sl_vs_ocu"      "group_idals.regionmed" "group_idals.regionsc"  "group_idals.regionsl" 

med_als <- colMeans(mod_mat[dds$region == "med" & dds$group_id == "als", ])
med_con <- colMeans(mod_mat[dds$region == "med" & dds$group_id == "con", ])

res1 <- results(dds, contrast = med_als - med_con)
res2 <- results(dds, contrast = list("region_med_vs_ocu"))

res1 and res2 are not the same. I am not sure what point I did not consider to get consistent results, with res1 I have way more number of DEGs compared to res2. Another question; if I am interested in the effect of "als" in "ocu" region and I extract "group_id_als_vs_con" considering that I put the reference level for region as "ocu" I expect to get the effect of "als" in ocu. However, I am not sure if this gives me the effect in batch 1 as it is the ref level. I need it to regress batch. Thank you so much for your input! Paria

tavareshugo commented 8 months ago

Thanks for getting in touch @pariaaliour.

Whenever you get those discrepancies, my suggestion is to look at the content of the coefficient vectors you created, to understand which of the coefficients are contributing to the final contrast.

Before I comment on what may be causing the differences, I note some issues:

Because I don't have access to your colData it's harder for me to troubleshoot. If you can share the colData(dds) that might make it easier. In any case, from looking at your code I think the difference is:

pariaaliour commented 8 months ago

Thanks for your swift response! -yes, it was a typo and I meant med_als and med-con -I did consider it numeric. I just saw that it affect gene expression and I included it. I can convert it to numeric and determine a range. -yes, I did Pseudo bulk analysis before theses DESeq running.

tavareshugo commented 8 months ago

Hi @pariaaliour , thanks for sharing your coldata. I'm sorry, I think I got myself muddled in my previous answer and not only over-complicated but was also wrong in my explanation there.

As you are not assuming any interactions between batch and region, then I think you were right in using the simpler contrast = list("region_med_vs_ocu").

Your question is also making me question this whole approach, which is probably not suitable for these more complex designs where you have a partially crossed imbalanced design (i.e. the number of replicates in each batch is different across "region" and "group_id").

Again, apologies for misleading you here. I will keep this issue open, and make sure to add a warning to the markdown for future users.

pariaaliour commented 8 months ago

It's alright, I believe I've realized that when we do not define an interaction term for two variables, the reference level may not hold significant meaning. However, I still have my previous question: why am I not obtaining the same results (res1 and res2) following two different approaches? I attempted to remove all the variables, and my design model was simply: design <- ~ group_id + region + group_id:region. Surprisingly, with this design, my res1 and res2 were the same. So, I am left wondering: is the first approach (resulting in res1) valid when we don't have any variable to regress out?

Thank you, Paria

tavareshugo commented 8 months ago

The reason the results are different is because your batch variable is partially crossed (and imbalanced) with the other groups. That is, replicates don't occur in all batches and further there is a different number of replicates in different batches. And my approach in the tutorial is flawed in these situations - which I have to thank you again for making me realise this! If you had the same number of batches and replicates in all groups, then it would have been fine.

With your design, the med_als - med_con includes a weight for some of the batch coefficients. But, as you said earlier, the batch effect has been regressed out and we do not want to account for it in the test between group_id. So, in this case, you should use the simpler approach with default DESeq contrast.

In the future, I will have to think about how to explain this better in this tutorial. For now I have added a warning to the tutorial, to make readers aware of this unresolved issue.

pariaaliour commented 8 months ago

Thanks for clarifying this! "With your design, the med_als - med_con includes a weight for some of the batch coefficients. But, as you said earlier, the batch effect has been regressed out and we do not want to account for it in the test between group_id." So, in this case, you should use the simpler approach with default DESeq contrast. With this, I was wondering do you think just extracting the coefficient would work? or what do you mean by simpler design. Also, did you mean it's better not to account for batch effect as I did a Seurat integration not any other batch correction. Many thanks, Paria

tavareshugo commented 8 months ago

What I meant earlier by "simpler approach" was to use the syntax contrast = list("region_med_vs_ocu") to extract the results between those two regions, not my numeric coefficients approach (as it doesn't work for your partially crossed design).

Whether to include batch in your model or not, I cannot comment, as I don't know much about your experiment. I would recommend that you find a local bioinformatician to discuss these more specific questions.

If you saw strong batch separation before Seurat integration, then it is quite possible that batch will also affect the pseudo-bulk analysis, as it uses the raw counts. You could run a PCA on the pseudo-bulk data (from your dds object), just as you would do for a regular RNA-seq and see if your samples separate by batch in the first few PCs. If they do, then it's probably a good idea to include it in the design.

On the other hand, if batch effect is mild, the results should be somewhat similar, whichever design you use.

Finally, whichever approach you use, make sure to check the results afterwards by looking at the normalised counts for the differentially expressed genes to confirm that the results make sense with the original count data.