theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
193 stars 23 forks source link

Exaggerated estimates and low standard deviation #191

Open Hrovatin opened 3 years ago

Hrovatin commented 3 years ago

Sometimes I get unexpectedly high estimates with very small standard deviation, see below. These estimates are not in accordance to expected lFC, but are significant.

(Image: fitted log2fc is equal to coef mle, colour scale is viridis - yellow is large, violet is small) image Significant estimates with very high absolute lFC all have sd at single number, may be some sort of machine/statistical method precision? Is this the same for you @cdedonno ?


# All significant genes with very high abs lFC as shown on the image above have sd at same number.

summaries['16 d'].query('qval< 0.1 & abs(log2fc) > 10').coef_sd.unique()
array([2.22275875e-162])
cdedonno commented 3 years ago

Hi @Hrovatin! Yeah, as we discussed last friday, this seems to be indeed a very similar issue to the one I presented. You also have these stack of genes sitting at very high (or very low) lFC values. Have you checked if these genes are lowly expressed?

davidsebfischer commented 3 years ago

If you want to look into whether this is related to inversion of the fisher information matrix for the wald test, compare -model._hessian (https://github.com/theislab/batchglm/blob/31b905b99b6baa7c94b82550d6a74f00d81966ea/batchglm/train/numpy/base_glm/estimator.py#L544) and model._fisher_inv (https://github.com/theislab/batchglm/blob/31b905b99b6baa7c94b82550d6a74f00d81966ea/batchglm/train/numpy/base_glm/estimator.py#L548)

davidsebfischer commented 3 years ago

There could also be numerical differences between the single coefficient wald test https://github.com/theislab/diffxpy/blob/c94e60a51a818363c57b7f112417dcb9bb96fe37/diffxpy/stats/stats.py#L181 and the generalisation to more parameters https://github.com/theislab/diffxpy/blob/c94e60a51a818363c57b7f112417dcb9bb96fe37/diffxpy/stats/stats.py#L221.

Hrovatin commented 3 years ago

Regarding weird lfc being at mainly unexpressed genes (based on n_cells expressing a gene): This is not the case always for me. The above example: image Another example, but here I may have too many coefficients so the model is not fit well. image Another more realistic example, before and after filtering out genes with sd at 2e-162 image image

N cells for genes with qval<0.05 and abs(lfc)>20: gene
Gm15462        33
Sypl2          38
Adgrg6         40
Gm12195        40
Cdc34b        123
Igf2bp1       309
AC163633.2     35
dtype: int64
Mean expr in gene-expr cells for genes with qval<0.05 and abs(lfc)>20: gene
Gm15462       1.058740
Sypl2         0.947558
Adgrg6        1.208749
Gm12195       1.105990
Cdc34b        0.919376
Igf2bp1       1.024707
AC163633.2    0.956821
dtype: float64

As you can see I have left a couple significant genes with high abs(lfc) that could not be filtered out based on mean expression in cells expressing the gene or based on n_cells as such high threshold would remove majority of genes (now I use threshold of gene being expressed in at least 30 cells and I have ~15000 genes left).

EDIT: Never mind the dotplot that was here before - I plotted it without taking into account the confounders.

Hrovatin commented 3 years ago

@davidsebfischer How should I compare the Hessian and Fisher info matrices?

davidsebfischer commented 3 years ago

I d just collect a couple of genes that look weird and than look for signs of potential numeric instability in the matrices! Small values, large values etc.

Hrovatin commented 3 years ago

Now I managed to get to a point where sd filtering (with what seems to be min possible sd value (across different tests)) produces ok results (I also improved design and increased n_cells gene filtering (now approx ~1% of population should express the gene)). So from my perspective we can close the issue, although sd filtering could be added to diffxpy if this is a common issue and min sd stays the same across platforms.

Hrovatin commented 3 years ago

This occurs for me when I have only categorical covariates but usually not if I have continuous covariates.