owkin / PyDESeq2

A Python implementation of the DESeq2 pipeline for bulk RNA-seq DEA.
https://pydeseq2.readthedocs.io/en/latest/
MIT License
584 stars 62 forks source link

BUG Fix min replicates #218

Closed BorisMuzellec closed 10 months ago

BorisMuzellec commented 10 months ago

This PR fixes some deprecated code which survived from the time when PyDESeq2 only supported single-factor designs, which caused the number of sample replicates to be evaluated from the last column of the design matrix only, instead of taking all factors into account.

This caused Cooks filtering to give diverging results compared to DESeq2. As an example, the test case introduced in this PR

counts_df =  load_example_data(
        modality="raw_counts",
        dataset="synthetic",
        debug=False,
    )

metadata =  load_example_data(
        modality="metadata",
        dataset="synthetic",
        debug=False,
    )

counts_df.loc["sample1", "gene1"] = 2000
counts_df.loc["sample11", "gene7"] = 1000
metadata.loc["sample1", "condition"] = "C"

dds = DeseqDataSet(
     counts=counts_df, metadata=metadata, design_factors=["group", "condition"]
)
dds.deseq2()

res = DeseqStats(dds, contrast=["condition", "B", "A"])
res.summary()

would previously return a NaN p-value for "gene1", whereas DESeq2 wouldn't.

What does your PR implement? Be specific.

This PR

It then corrects deprecated single-factor code by

Finally, it implements a new test case in which two outlier counts and a dummy "C" condition with a single replicate are introduced.