when I calculate coefficient vectors for each group the results are not the same as when the coefficients estimated by DEseq

pariaaliour commented 8 months ago

Dear @tavareshugo, I am currently conducting single-nuclei RNA sequencing and have been following your tutorial closely. However, I've encountered a discrepancy in my results when comparing the extraction of coefficients from the model versus calculating the coefficient vector for each group. Here is the design model I've been using: design <- ~ sex + PMI + batchlib + group_id + region + group_id:region

#pre-filtering
keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]
#assign ref level
dds$group_id <- relevel(dds$group_id, ref="con")
dds$region <- relevel(dds$region, ref = "ocu")
#running DESeq
dds <- DESeq(dds, fitType='local')
resultsNames(dds)
 [1] "Intercept"             "sex_2_vs_1"            "PMI"                   "batchlib_2_vs_1"      
 [5] "batchlib_3_vs_1"       "batchlib_4_vs_1"       "batchlib_5_vs_1"       "batchlib_6_vs_1"      
 [9] "batchlib_7_vs_1"       "group_id_als_vs_con"   "region_med_vs_ocu"     "region_sc_vs_ocu"     
[13] "region_sl_vs_ocu"      "group_idals.regionmed" "group_idals.regionsc"  "group_idals.regionsl" 

med_als <- colMeans(mod_mat[dds$region == "med" & dds$group_id == "als", ])
med_con <- colMeans(mod_mat[dds$region == "med" & dds$group_id == "con", ])

res1 <- results(dds, contrast = med_als - med_con)
res2 <- results(dds, contrast = list("region_med_vs_ocu"))

res1 and res2 are not the same. I am not sure what point I did not consider to get consistent results, with res1 I have way more number of DEGs compared to res2. Another question; if I am interested in the effect of "als" in "ocu" region and I extract "group_id_als_vs_con" considering that I put the reference level for region as "ocu" I expect to get the effect of "als" in ocu. However, I am not sure if this gives me the effect in batch 1 as it is the ref level. I need it to regress batch. Thank you so much for your input! Paria

tavareshugo commented 8 months ago

Thanks for getting in touch @pariaaliour.

Whenever you get those discrepancies, my suggestion is to look at the content of the coefficient vectors you created, to understand which of the coefficients are contributing to the final contrast.

Before I comment on what may be causing the differences, I note some issues:

In the code you shared you define med_als and med_con but then the contrast is between med_con and ocu_con. I assume this was a typo in copy/pasting your code.
The variable PMI seems to be numeric -- is this intentional? I would caution using numeric variables with DESeq2, unless you really expect linear trends between PMI and gene expression.
You say that you are analysing single-nuclei data. Those data require several other considerations, in particular you should not do the analysis directly on the single nuclei, but rather on pseudo-bulks using biological replicates. I just wanted to check that you have done that ahead of your DESeq analysis.
I would also like to ask if your design is balanced? I.e. do you have the same number of replicates in each group (batch, region, group, sex), or different numbers of replicates?

Because I don't have access to your colData it's harder for me to troubleshoot. If you can share the colData(dds) that might make it easier. In any case, from looking at your code I think the difference is:

contrast = list("region_med_vs_ocu") gives you the comparison of "med" vs "ocu" regions for the reference group in your dataset, which in your case I believe is: sex 1, PMI = 0, batch 1, group con.
contrast = med_con - ocu_con is also comparing the "med" vs "ocu" regions for the "con" group, but averaging the effect across batches and sexes (i.e. individuals from all batches and both sexes are considered) as well as the average value of PMI (attention here - as PMI seems to be a numeric variable, your contrast needs to be set for some value of PMI. Using the colMeans() strategy you will therefore estimate the contrast for the average value of PMI, which may or may not be what you want) edit: sorry, I don't think the PMI variable matters much here, as there's no interaction between it and the other variables; all the other factors are essentially an intercept offset to the regression on PMI.

pariaaliour commented 8 months ago

Thanks for your swift response! -yes, it was a typo and I meant med_als and med-con -I did consider it numeric. I just saw that it affect gene expression and I included it. I can convert it to numeric and determine a range. -yes, I did Pseudo bulk analysis before theses DESeq running.

my design is not balanced. I can share my colData(dds). coldata.csv
Based on what you mentioned when I extract the interested coefficient it outputs the effect in the reference level of batch, sex, etc. I think this is not what I want. I need it to consider everything unless it's going to be a limited sample size for each comparison. What I thought is that putting sex, batch, etc. would regress their effect out.
could you please let me know which way is more reliable to follow? and how to do it in my case? Many thanks, Paria

tavareshugo commented 8 months ago

Hi @pariaaliour , thanks for sharing your coldata. I'm sorry, I think I got myself muddled in my previous answer and not only over-complicated but was also wrong in my explanation there.

As you are not assuming any interactions between batch and region, then I think you were right in using the simpler contrast = list("region_med_vs_ocu").

Your question is also making me question this whole approach, which is probably not suitable for these more complex designs where you have a partially crossed imbalanced design (i.e. the number of replicates in each batch is different across "region" and "group_id").

Again, apologies for misleading you here. I will keep this issue open, and make sure to add a warning to the markdown for future users.

pariaaliour commented 8 months ago

It's alright, I believe I've realized that when we do not define an interaction term for two variables, the reference level may not hold significant meaning. However, I still have my previous question: why am I not obtaining the same results (res1 and res2) following two different approaches? I attempted to remove all the variables, and my design model was simply: design <- ~ group_id + region + group_id:region. Surprisingly, with this design, my res1 and res2 were the same. So, I am left wondering: is the first approach (resulting in res1) valid when we don't have any variable to regress out?

Thank you, Paria

tavareshugo commented 8 months ago

The reason the results are different is because your batch variable is partially crossed (and imbalanced) with the other groups. That is, replicates don't occur in all batches and further there is a different number of replicates in different batches. And my approach in the tutorial is flawed in these situations - which I have to thank you again for making me realise this! If you had the same number of batches and replicates in all groups, then it would have been fine.

With your design, the med_als - med_con includes a weight for some of the batch coefficients. But, as you said earlier, the batch effect has been regressed out and we do not want to account for it in the test between group_id. So, in this case, you should use the simpler approach with default DESeq contrast.

In the future, I will have to think about how to explain this better in this tutorial. For now I have added a warning to the tutorial, to make readers aware of this unresolved issue.

pariaaliour commented 8 months ago

Thanks for clarifying this! "With your design, the med_als - med_con includes a weight for some of the batch coefficients. But, as you said earlier, the batch effect has been regressed out and we do not want to account for it in the test between group_id." So, in this case, you should use the simpler approach with default DESeq contrast. With this, I was wondering do you think just extracting the coefficient would work? or what do you mean by simpler design. Also, did you mean it's better not to account for batch effect as I did a Seurat integration not any other batch correction. Many thanks, Paria

tavareshugo commented 8 months ago

What I meant earlier by "simpler approach" was to use the syntax contrast = list("region_med_vs_ocu") to extract the results between those two regions, not my numeric coefficients approach (as it doesn't work for your partially crossed design).

Whether to include batch in your model or not, I cannot comment, as I don't know much about your experiment. I would recommend that you find a local bioinformatician to discuss these more specific questions.

If you saw strong batch separation before Seurat integration, then it is quite possible that batch will also affect the pseudo-bulk analysis, as it uses the raw counts. You could run a PCA on the pseudo-bulk data (from your dds object), just as you would do for a regular RNA-seq and see if your samples separate by batch in the first few PCs. If they do, then it's probably a good idea to include it in the design.

On the other hand, if batch effect is mild, the results should be somewhat similar, whichever design you use.

Finally, whichever approach you use, make sure to check the results afterwards by looking at the normalised counts for the differentially expressed genes to confirm that the results make sense with the original count data.

tavareshugo / tutorial_DESeq2_contrasts

when I calculate coefficient vectors for each group the results are not the same as when the coefficients estimated by DEseq #9