Closed ilan-gold closed 7 months ago
Hi @ilan-gold (also ping-ing @Zethson), thanks for reporting this behaviour. I've been able to reproduce it as well.
I'm not sure what exactly is happening here. I tried it in DESeq2 with the same result:
library(DESeq2)
counts = t(as.matrix(read.csv("debug_counts.csv" , row.names="X")))
metadata = as.matrix(read.csv("debug_metadata.csv", row.names = "X"))
metadata = as.data.frame(metadata)
dds = DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~ condition)
dds = DESeq(dds, fitType="glmGamPoi")
res = results(dds, alpha = 0.05)
and I get
log2 fold change (MLE): condition B vs A
Wald test p-value: condition B vs A
DataFrame with 2 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
gene1 64.3728 -2.88811 0.306839 -9.41246 4.84656e-21 4.84656e-21
gene2 61.6767 2.66219 0.276395 9.63183 5.86779e-22 1.17356e-21
I'm not sure what's going on, except maybe that this test case is too difficult / unsuited for (py)DESeq2? In particular, one of the assumptions is that each gene's count data roughly follows a negative binomial distribution, but gene1 (and gene2) is actually a mixture of two very different negative binomials, with few samples each:
Here there's one or more zeros per gene, so it's not possible to take means of logs. We have to use alternative ways to fit size factors. PyDESeq2 only supports iterative size factors.
With iterative size factors DESeq2 stalls
> dds = DESeq(dds, fitType="glmGamPoi", sfType="iterate")
estimating size factors
Erreur dans estimateSizeFactorsIterate(object) :
iterative size factor normalization did not converge
(it also does with the default option for size factors)
> dds = DESeq(dds, fitType="glmGamPoi")
estimating size factors
Erreur dans estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means
so I can't compare behaviours.
There is a bug in PyDESeq2 due to incompatible shapes which lies at the intersection of using mean dispersion trend curves on a single gene when refitting. I was able to fix it and to finish running the pipeline. Here's what I get:
Log2 fold change & Wald test p-value: condition A vs B
baseMean log2FoldChange lfcSE stat pvalue \
gene1 236.510015 4.520883 0.226170 19.988841 6.887740e-89
gene2 19.173095 -0.007695 0.019387 -0.396925 6.914227e-01
padj
gene1 1.377548e-88
gene2 6.914227e-01
which looks like what we were expecting.
I'll open a PR ASAP, I just need to implement a proper test case to ensure this doesn't happen again.
Update: out of the 3 size factor options in DESeq2 ("ratio", "poscounts", "iterate"
), only "poscounts "
runs without failure on the 400 samples test case.
library(DESeq2)
counts = t(as.matrix(read.csv("big_debug_counts.csv" , row.names="X")))
metadata = as.matrix(read.csv("big_debug_metadata.csv", row.names = "X"))
metadata = as.data.frame(metadata)
dds = DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~ condition)
dds = DESeq(dds, fitType="glmGamPoi", sfType="poscounts")
res = results(dds, alpha = 0.05)
Here's what I get then:
> res
log2 fold change (MLE): condition B vs A
Wald test p-value: condition B vs A
DataFrame with 2 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
gene1 61.9988 -2.73178 0.138032 -19.7909 3.56746e-87 3.56746e-87
gene2 59.7801 2.62376 0.126620 20.7216 2.21029e-95 4.42057e-95
It's still not capturing the differential expression pattern. (NB: this is DESeq2, not PyDESeq2)
Hi @BorisMuzellec this all makes sense. I just wanted to clarify something for my own understanding
In particular, one of the assumptions is that each gene's count data roughly follows a negative binomial distribution, but gene1 (and gene2) is actually a mixture of two very different negative binomials
But isn't this totally normal? You have two groups drawn from different negative binomial distributions and the test should be able to pick this difference up?
You have two groups drawn from different negative binomial distributions and the test should be able to pick this difference up?
What the model does is, for each gene, to fit a negative binomial distribution with a single dispersion parameter, but a mean (means actually) that depend on the design factors. More precisely:
(This step is actually further decomposed into smaller ones, but we're still using a single dispersion per gene to model its distribution. Also note that the initial mean estimate does take the design into account.)
My intuition was that in the first case (80 samples) the model was having a hard time fitting dispersions, but after further inspection it turns out that in the second case (400 samples) pyDESeq2 (after the fix) finds the same dispersion values as in the first, but LFC more in line with what we would expect.
Not sure what's going, perhaps 80 is just too few samples...
So it seems like conceptually it was not far off - that being said, fixing the dispersion parameter and letting the other vary is not helping....maybe it is just too few samples
Describe the bug I am doing some testing for a project and creating data from negative binomial distributions. Two things are happening.
gene2
is extraordinarily low on the model that fits given thatcondition
tested contains identical points. The fold change also looks off.To Reproduce
Expected behavior I would expect
gene2
to not have a significant p-value, the log fold change to be smaller, and the 400 sample point example to fit.Screenshots For the one that fit:
Desktop (please complete the following information):
0.4.7
Additional context Apologies if there is a bug in my code! It's very possble!