owkin / PyDESeq2

A Python implementation of the DESeq2 pipeline for bulk RNA-seq DEA.
https://pydeseq2.readthedocs.io/en/latest/
MIT License
572 stars 60 forks source link

Feature request: test for continuous variables and create design matrix from formula #309

Open MaximilianNuber opened 1 week ago

MaximilianNuber commented 1 week ago

Dear all, thank you for the great package, it´s a staple in my workflows.

Recently, I had the task to analyze continuous covariates in my RNA-seq data. I know continuous variables can be included in the model, but I would like to estimate the p-values for the logfoldchanges, and the DESeqStats class takes only a contrast, e.g. from a categorical factor. Estimating continuous variables would be amazing. (I did check and test if it´s possible, but could not find anything. Please correct me if I am wrong.)

A minor feature request: Add the option to create the design from a mode formula (patsy-style) in DESeqDataSet. If I can set the reference level of several categorical factors in my data and then create the design matrix from a formula (in comparison to setting the ref_level of only one design factor), I could fit a model once and then extract DESeqStats for several variables.

Thank you for any help and best regards, Max

BorisMuzellec commented 1 week ago

Hi @MaximilianNuber, you can estimate p-values for continuous covariates using the contrast argument of DeseqStats. The syntax is contrast = ["my_continuous_covariate", "", ""] (a list containing the name of your continuous covariates followed with two empty strings, cf https://pydeseq2.readthedocs.io/en/latest/api/docstrings/pydeseq2.ds.DeseqStats.html#pydeseq2.ds.DeseqStats).

Supporting more general designs using patsy / formulaic is WIP, started in #181. It might take a while though, as it will require many changes in the source code of the `DeseqStats' class.