kieranrcampbell / ouija

Descriptive probabilistic marker gene approach to single-cell pseudotime inference
http://kieranrcampbell.github.io/ouija
28 stars 3 forks source link

Inference without modeling dropout? #9

Open mochar opened 5 years ago

mochar commented 5 years ago

Hi Kieran, thanks for the cool package. I am interested in learning more about Bayesian stuff so your other work seem interesting as well!

Recently there has been talk about how UMI count data in scRNA-seq is not zero-inflated. Instead it is recommended to model UMI counts using a negative binomial (or even a Poission) distribution. (I can share some papers if you'd like)

For this reason I was wondering if there was a way to omit the explicit modeling of zero counts. Also your thoughts on using the aforementioned distributions to directly use the gene counts instead of the log-transformed CPM data.

Thanks!

kieranrcampbell commented 5 years ago

Hi @mochar

This is an excellent question. Ouija dates from the dark days of single cell analysis when I / we would log data and model the log counts with gaussians, rather than modelling directly the raw counts with e.g. negative binomials.

From a modelling perspective, if you log the data and use gaussian you probably do want to model a zero inflated component, since the gaussian is mis-specified and has infinitesimal mass at zero, compared to the negative binomial that actually does have probability mass there. However, if we weren't to log the data and used a negative binomial then as Valentine Svensson and others have recently pointed out, you probably wouldn't want to include inflation at zero.

In terms of results, I suspect for Ouija it would make little difference, but if it's something you would find useful we could create a negative binomial variant. I think the modification would be fairly trivial (remove the zero inflation, change the likelihood to NB with the mean exponentiated). The mean-variance parametrization might be a little tricky however.

Thanks

Kieran

mochar commented 5 years ago

Thank you for the quick and helpful response @kieranrcampbell (and apologies for my slow response!).

I share your suspicion that in the end it would make little difference. I applied slalom on logcounts using a Hurdle noise model, and on scTransform corrected counts with a Poisson model. Both options lead to the same set of factors with similar cell loadings. I also used the Gaussian noise model on scTransform's pearson residuals (which are not really normally distributed) and the results are again mostly the same.

Nonetheless for the sake of using the same noise model when applying both tools I would actually appreciate a NB and/or Poisson implementation. If you have the time for this I would be glad to test it out and share the results.

Thanks again!