Open Hrovatin opened 3 years ago
Hi @Hrovatin! Yeah, as we discussed last friday, this seems to be indeed a very similar issue to the one I presented. You also have these stack of genes sitting at very high (or very low) lFC values. Have you checked if these genes are lowly expressed?
If you want to look into whether this is related to inversion of the fisher information matrix for the wald test, compare -model._hessian
(https://github.com/theislab/batchglm/blob/31b905b99b6baa7c94b82550d6a74f00d81966ea/batchglm/train/numpy/base_glm/estimator.py#L544) and model._fisher_inv
(https://github.com/theislab/batchglm/blob/31b905b99b6baa7c94b82550d6a74f00d81966ea/batchglm/train/numpy/base_glm/estimator.py#L548)
There could also be numerical differences between the single coefficient wald test https://github.com/theislab/diffxpy/blob/c94e60a51a818363c57b7f112417dcb9bb96fe37/diffxpy/stats/stats.py#L181 and the generalisation to more parameters https://github.com/theislab/diffxpy/blob/c94e60a51a818363c57b7f112417dcb9bb96fe37/diffxpy/stats/stats.py#L221.
Regarding weird lfc being at mainly unexpressed genes (based on n_cells expressing a gene): This is not the case always for me. The above example: Another example, but here I may have too many coefficients so the model is not fit well. Another more realistic example, before and after filtering out genes with sd at 2e-162
N cells for genes with qval<0.05 and abs(lfc)>20: gene
Gm15462 33
Sypl2 38
Adgrg6 40
Gm12195 40
Cdc34b 123
Igf2bp1 309
AC163633.2 35
dtype: int64
Mean expr in gene-expr cells for genes with qval<0.05 and abs(lfc)>20: gene
Gm15462 1.058740
Sypl2 0.947558
Adgrg6 1.208749
Gm12195 1.105990
Cdc34b 0.919376
Igf2bp1 1.024707
AC163633.2 0.956821
dtype: float64
As you can see I have left a couple significant genes with high abs(lfc) that could not be filtered out based on mean expression in cells expressing the gene or based on n_cells as such high threshold would remove majority of genes (now I use threshold of gene being expressed in at least 30 cells and I have ~15000 genes left).
EDIT: Never mind the dotplot that was here before - I plotted it without taking into account the confounders.
@davidsebfischer How should I compare the Hessian and Fisher info matrices?
I d just collect a couple of genes that look weird and than look for signs of potential numeric instability in the matrices! Small values, large values etc.
Now I managed to get to a point where sd filtering (with what seems to be min possible sd value (across different tests)) produces ok results (I also improved design and increased n_cells gene filtering (now approx ~1% of population should express the gene)). So from my perspective we can close the issue, although sd filtering could be added to diffxpy if this is a common issue and min sd stays the same across platforms.
This occurs for me when I have only categorical covariates but usually not if I have continuous covariates.
Sometimes I get unexpectedly high estimates with very small standard deviation, see below. These estimates are not in accordance to expected lFC, but are significant.
(Image: fitted log2fc is equal to coef mle, colour scale is viridis - yellow is large, violet is small) Significant estimates with very high absolute lFC all have sd at single number, may be some sort of machine/statistical method precision? Is this the same for you @cdedonno ?