theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
191 stars 23 forks source link

ZINB noise model #131

Open grst opened 4 years ago

grst commented 4 years ago

Van den Berge et al, Genome Biol 2019 argue that using a Zero-inflated negative binomial (ZINB) model significantly boosts the performance of differential gene expression analysis on single-cell data.

Using their package, zingeR or zinbWAVE it is possible to compute per-cell weights, which can be factored in most linear models. Once #117 is fixed, this would also be possible with diffxpy.

However, zingeR and zinbWAVE are quite slow, and it could probably be sped up significantly, or at least massively parallelized when implemented with tensorflow. Ideally, this would be available right from diffxpy as an additional noise model.

There's already a tensorflow implementation of a ZINB model available in the scqtl package. Also there's a ZINB model implemented in numpy within the statsmodels package from which you could draw some inspiration ;)

Is that something you would consider implementing in a future version of diffxpy?

Best, Gregor

CC @abyssum

davidsebfischer commented 4 years ago

Hi Gregor!

  1. Weights are going to come, hopefully this month. We are switching the tf1 backend to tf2 and adding a numpy/dask backend, the weighting will be easier to introduce there than in tf1. The workflow that you propose will be possible then.

  2. To the best of my knowledge, the consensus in the field right now is that zero-inflation was a technical noise effect that only occured in protocols that did not use UMIs, probably related to biased PCR amplification. This is why we do not have zero-inflated models right now. Thank you for the reference, I will read what their opinion is on that topic but I think that these models should be treated with care. However, we can in principle include them. Building the noise function in tensorflow or numpy is relatively easy, one would have to decide on a way to parameterise the mixture weigh pi though and then build a parameter estimation around that. Are you intersted in having a single scalar for pi per cell or per gene? My intuition is also that this should be faster and more accurate than a sequential model fit in which the first round just estimates observation weights.

Best, David

grst commented 4 years ago

Hi David,

thanks for your feedback!

Weights are going to come, hopefully this month. We are switching the tf1 backend to tf2 and adding a numpy/dask backend, the weighting will be easier to introduce there than in tf1. The workflow that you propose will be possible then.

That are great news!

Are you interested in having a single scalar for pi per cell or per gene?

As far as I can tell, in the study mentioned above, they use one weight per cell. However, I do not know if that really makes more sense than having a per-gene weight.

Cheers, Gregor