owkin / PyDESeq2

A Python implementation of the DESeq2 pipeline for bulk RNA-seq DEA.
https://pydeseq2.readthedocs.io/en/latest/
MIT License
573 stars 60 forks source link

[BUG] The vst() function does not work when the algorithm switches to a mean-based dispersion trend. #263

Closed dguin closed 3 months ago

dguin commented 5 months ago

Describe the bug Calling the vst() leads to the fit_type variable being hard-coded to the chosen fit_type. Consequently when the vst_transform is called it breaks with: File ".venv/lib/python3.10/site-packages/pydeseq2/dds.py", line 411, in vst_transform a0, a1 = self.uns["trend_coeffs"] KeyError: 'trend_coeffs'

Expected behavior When switching to mean based, update fit_type to 'mean' in line 918. self.fit_type = 'mean'

BorisMuzellec commented 5 months ago

Hi @dguin, good catch, indeed with the recent versions there now are two attributes redundantly encoding the trend curve fit type, self.fit_type (set in vst) and self.trend_fit_type. We should harmonize them.

Before I make any changes, one thing I'm wondering about is whether using one fit type for VST and another in the DEA (e.g. "mean" and then "parametric", or vice-versa) is a common thing, or if we can consider that the fit types should be the same in both?

dguin commented 5 months ago

Hi again @BorisMuzellec, I looked at the R documentation and code and it appears that the R version of DESeq2 re-fits dispersion if fit-type is set to "parametric" and dispersions have not been calculated (https://rdrr.io/bioc/DESeq2/man/varianceStabilizingTransformation.html):

in case dispersions have not yet been estimated for object, this parameter is passed on to estimateDispersions (options described there). So it does appear that the R code at least allows for using one fit type in the VST and another in DEA. Additionally the readme mentions:

Note that neither rlog transformation nor the VST are used by the differential expression estimation in DESeq, which always occurs on the raw count data, through generalized linear modeling which incorporates knowledge of the variance-mean dependence. The rlog transformation and VST are offered as separate functionality which can be used for visualization, clustering or other machine learning tasks. See the transformation section of the vignette for more details. So it appears that they are meant to be slightly decoupled.

Happy to make the changes and open a PR if you'd like!!