BUG Fix min replicates - Githubissues

This PR fixes some deprecated code which survived from the time when PyDESeq2 only supported single-factor designs, which caused the number of sample replicates to be evaluated from the last column of the design matrix only, instead of taking all factors into account.

This caused Cooks filtering to give diverging results compared to DESeq2. As an example, the test case introduced in this PR

counts_df =  load_example_data(
        modality="raw_counts",
        dataset="synthetic",
        debug=False,
    )

metadata =  load_example_data(
        modality="metadata",
        dataset="synthetic",
        debug=False,
    )

counts_df.loc["sample1", "gene1"] = 2000
counts_df.loc["sample11", "gene7"] = 1000
metadata.loc["sample1", "condition"] = "C"

dds = DeseqDataSet(
     counts=counts_df, metadata=metadata, design_factors=["group", "condition"]
)
dds.deseq2()

res = DeseqStats(dds, contrast=["condition", "B", "A"])
res.summary()

would previously return a NaN p-value for "gene1", whereas DESeq2 wouldn't.

What does your PR implement? Be specific.

This PR

Implements a function to check how many sample replicates each factor combination has (utils.n_or_more_replicates, cf DESeq2's nOrMoreInCell)

It then corrects deprecated single-factor code by

using n_or_more_replicates in dds._replace_outliers and ds._cooks_filtering to take all factors into account instead of only looking at the last column in the design matrix
likewise, using n_or_more_replicates and grouping samples by unique row combinations in utils.robust_method_of_moments_disp, which is used to calculate cooks distances (cf in DESeq2).

Finally, it implements a new test case in which two outlier counts and a dummy "C" condition with a single replicate are introduced.

owkin / PyDESeq2

BUG Fix min replicates #218

What does your PR implement? Be specific.