[BUG] DE Testing on negative binomial examples

ilan-gold commented 8 months ago

Describe the bug I am doing some testing for a project and creating data from negative binomial distributions. Two things are happening.

When I increase the number of samples, the example breaks
The p-value for gene2 is extraordinarily low on the model that fits given that condition tested contains identical points. The fold change also looks off.

To Reproduce

import os
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

import anndata as ad
import numpy as np
import pandas as pd
import pytest
from pydeseq2.utils import load_example_data

def test_adata_minimal(n_obs):
    n_donors = n_obs // 4
    obs = pd.DataFrame(
        {
            "condition": ["A", "B"] * (n_obs // 2),
            "donor": sum(([f"D{i}"] * n_donors for i in range(n_obs // n_donors)), []),
        },
    ) # condition splits the data into two, donor contains mixtures of the condition
    var = pd.DataFrame(index=["gene1", "gene2"])

    rng = np.random.default_rng(9)  # make tests deterministic
    group1 = rng.negative_binomial(20, 0.1, n_obs // 2)  # large mean
    group2 = rng.negative_binomial(5, 0.5, n_obs // 2)  # small mean

    condition_data = np.empty((n_obs,), dtype=group1.dtype)
    condition_data[0::2] = group1
    condition_data[1::2] = group2

    # identical for `A` and `B` but `D1` vs `D0` will be diff. expressed (same for `D3` vs `D0`, `D1` vs `D2`, `D2` vs `D3`)
    donor_data = np.empty((n_obs,), dtype=group1.dtype)
    donor_data[0:n_donors] = group2[:n_donors]
    donor_data[n_donors : (2 * n_donors)] = group1[n_donors:]
    donor_data[(2 * n_donors) : (3 * n_donors)] = group2[:n_donors]
    donor_data[(3 * n_donors) :] = group1[n_donors:]

    X = np.vstack([condition_data, donor_data]).T
    return ad.AnnData(X=X, obs=obs, var=var)

inference = DefaultInference(n_cpus=os.cpu_count() - 1)
covars = ['condition']

adata = test_adata_minimal(80) # using 80 obs, everything works
dds = DeseqDataSet(adata=adata, design_factors=covars, refit_cooks=True, inference=inference)
dds.deseq2()
# gene2 should not have such a low p-value because `donor_data` above in the data-generation function contains identical data split across `A` and `B` of `condition` - the log fold change is also very high
stat_res = DeseqStats(dds, contrast=['condition', 'A', 'B'])
stat_res.summary()
# this is very low?
np.log2(adata[adata.obs['condition'] == 'A'].X[:, 1].mean()) - np.log2(adata[adata.obs['condition'] == 'B'].X[:, 1].mean())

adata = test_adata_minimal(400) # using 400 does not
dds = DeseqDataSet(adata=adata, design_factors=covars, refit_cooks=True, inference=inference)
dds.deseq2()

Expected behavior I would expect gene2 to not have a significant p-value, the log fold change to be smaller, and the 400 sample point example to fit.

Screenshots For the one that fit:

Log2 fold change & Wald test p-value: condition A vs B
        baseMean  log2FoldChange     lfcSE      stat        pvalue          padj
gene1  64.372750        2.888815  0.306117  9.436968  3.837211e-21  3.837211e-21
gene2  61.676662       -2.662254  0.277285 -9.601164  7.904677e-22  1.580935e-21

Desktop (please complete the following information):

OS: Ventura 13.16.1
Version 0.4.7

Additional context Apologies if there is a bug in my code! It's very possble!

BorisMuzellec commented 8 months ago

Hi @ilan-gold (also ping-ing @Zethson), thanks for reporting this behaviour. I've been able to reproduce it as well.

Case with 80 samples:

I'm not sure what exactly is happening here. I tried it in DESeq2 with the same result:

library(DESeq2)
counts = t(as.matrix(read.csv("debug_counts.csv" , row.names="X")))
metadata = as.matrix(read.csv("debug_metadata.csv", row.names = "X"))
metadata = as.data.frame(metadata)
dds = DESeqDataSetFromMatrix(countData = counts,
                             colData = metadata,
                             design = ~ condition)
dds = DESeq(dds, fitType="glmGamPoi")
res = results(dds, alpha = 0.05)

and I get

log2 fold change (MLE): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 2 rows and 6 columns
       baseMean log2FoldChange     lfcSE      stat      pvalue        padj
      <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
gene1   64.3728       -2.88811  0.306839  -9.41246 4.84656e-21 4.84656e-21
gene2   61.6767        2.66219  0.276395   9.63183 5.86779e-22 1.17356e-21

I'm not sure what's going on, except maybe that this test case is too difficult / unsuited for (py)DESeq2? In particular, one of the assumptions is that each gene's count data roughly follows a negative binomial distribution, but gene1 (and gene2) is actually a mixture of two very different negative binomials, with few samples each:

gene1_small

Case with 400 samples:

Here there's one or more zeros per gene, so it's not possible to take means of logs. We have to use alternative ways to fit size factors. PyDESeq2 only supports iterative size factors.

With iterative size factors DESeq2 stalls

> dds = DESeq(dds, fitType="glmGamPoi", sfType="iterate")
estimating size factors
Erreur dans estimateSizeFactorsIterate(object) : 
  iterative size factor normalization did not converge

(it also does with the default option for size factors)

> dds = DESeq(dds, fitType="glmGamPoi")
estimating size factors
Erreur dans estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  : 
  every gene contains at least one zero, cannot compute log geometric means

so I can't compare behaviours.

There is a bug in PyDESeq2 due to incompatible shapes which lies at the intersection of using mean dispersion trend curves on a single gene when refitting. I was able to fix it and to finish running the pipeline. Here's what I get:

Log2 fold change & Wald test p-value: condition A vs B
         baseMean  log2FoldChange     lfcSE       stat        pvalue  \
gene1  236.510015        4.520883  0.226170  19.988841  6.887740e-89   
gene2   19.173095       -0.007695  0.019387  -0.396925  6.914227e-01   

               padj  
gene1  1.377548e-88  
gene2  6.914227e-01

which looks like what we were expecting.

I'll open a PR ASAP, I just need to implement a proper test case to ensure this doesn't happen again.

BorisMuzellec commented 8 months ago

Update: out of the 3 size factor options in DESeq2 ("ratio", "poscounts", "iterate"), only "poscounts " runs without failure on the 400 samples test case.

library(DESeq2)
counts = t(as.matrix(read.csv("big_debug_counts.csv" , row.names="X")))
metadata = as.matrix(read.csv("big_debug_metadata.csv", row.names = "X"))
metadata = as.data.frame(metadata)
dds = DESeqDataSetFromMatrix(countData = counts,
                             colData = metadata,
                             design = ~ condition)
dds = DESeq(dds, fitType="glmGamPoi",  sfType="poscounts")
res = results(dds, alpha = 0.05)

Here's what I get then:

> res
log2 fold change (MLE): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 2 rows and 6 columns
       baseMean log2FoldChange     lfcSE      stat      pvalue        padj
      <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
gene1   61.9988       -2.73178  0.138032  -19.7909 3.56746e-87 3.56746e-87
gene2   59.7801        2.62376  0.126620   20.7216 2.21029e-95 4.42057e-95

It's still not capturing the differential expression pattern. (NB: this is DESeq2, not PyDESeq2)

ilan-gold commented 8 months ago

Hi @BorisMuzellec this all makes sense. I just wanted to clarify something for my own understanding

In particular, one of the assumptions is that each gene's count data roughly follows a negative binomial distribution, but gene1 (and gene2) is actually a mixture of two very different negative binomials

But isn't this totally normal? You have two groups drawn from different negative binomial distributions and the test should be able to pick this difference up?

BorisMuzellec commented 8 months ago

You have two groups drawn from different negative binomial distributions and the test should be able to pick this difference up?

What the model does is, for each gene, to fit a negative binomial distribution with a single dispersion parameter, but a mean (means actually) that depend on the design factors. More precisely:

The first step is to estimate dispersions (1 dispersion per gene), using an initial estimate of the means:

(This step is actually further decomposed into smaller ones, but we're still using a single dispersion per gene to model its distribution. Also note that the initial mean estimate does take the design into account.)

The second is to fit LFC, fixing dispersions. This is where the design starts taking a more active role in the model:

My intuition was that in the first case (80 samples) the model was having a hard time fitting dispersions, but after further inspection it turns out that in the second case (400 samples) pyDESeq2 (after the fix) finds the same dispersion values as in the first, but LFC more in line with what we would expect.

Not sure what's going, perhaps 80 is just too few samples...

ilan-gold commented 7 months ago

So it seems like conceptually it was not far off - that being said, fixing the dispersion parameter and letting the other vary is not helping....maybe it is just too few samples

owkin / PyDESeq2

[BUG] DE Testing on negative binomial examples #249