oscarclivio / AutoZI_reproducibility

Reproducibility notebooks for the AutoZI paper, or "Detecting Zero-Inflated Genes in Single-Cell Transcriptomics Data"
https://www.biorxiv.org/content/10.1101/794875v2
5 stars 0 forks source link

Zero-inflated, now what? #1

Closed Maarten-vd-Sande closed 5 years ago

Maarten-vd-Sande commented 5 years ago

I just read your pre-print, and the method looks very promising :tada: ! I am thinking how this could be a nice addition to our analysis, but I am unsure in how to interpret its results.

Do I understand correctly that the method outputs either zero-inflated, or not zero-inflated, and not e.g. the percentage of zero-inflation?

If so, how do I adjust my analysis based on whether or not it is or isnt zero-inflated? Should I iteratively remove zeros until the method does not report that it is zero-inflated?

oscarclivio commented 5 years ago

Hey Maarten!

I just read your pre-print, and the method looks very promising !

Great to hear that, thank you very much!

Do I understand correctly that the method outputs either zero-inflated, or not zero-inflated, and not e.g. the percentage of zero-inflation?

More precisely, the method outputs, for each gene, the paramaters of the posterior distribution of the the delta_g random variable, which represents the coefficient of the NB component in the NB-ZINB mixture. In our analysis, we focused on the decision rule q(delta_g < 0.5) > 0.5 for ZINB, where q(delta_g < 0.5) is the probability that the gene is zero-inflated (when "delta_g < 0.5" is the hypothesis for zero-inflation and it will classify each gene as "zero-inflated" and "not zero-inflated". However, you are free to focus just on q(delta_g < 0.5), or even fix your own threshold depending on your preference for false positives or false negatives. ROC curves can help you define this threshold. Another metric is the expectation of delta_g w.r.t. q(delta_g) (equal to alpha_g / alpha_g + beta_g), and it is also between 0 and 1. It may more easily be interpretated as a percentage of the absence of zero-inflation (and 1-delta_g would be the percentage of zero-inflation) than q(delta_g < 0.5), as delta_g is the coefficient of NB in the mixture model (and 1-delta_g is the coefficient of ZINB).

If so, how do I adjust my analysis based on whether or not it is or isnt zero-inflated? Should I iteratively remove zeros until the method does not report that it is zero-inflated?

I would not suggest removing zeros or trying to engineer the output of the model in general, as zero-inflation may reveal technical but also actual biological patterns, such as transcriptional burst kinetics for the latter. Both types of patterns are present in our analysis and we did not perfectly disentangle them yet. Also the method reports zero-inflation for individual genes but it is fitted on datasets with several genes. How would you handle cells with a zero for a gene reported as ZI but a positive count for another gene reported as ZI ? You would make one gene less ZI but another more ZI.

I am thinking how this could be a nice addition to our analysis, but I am unsure in how to interpret its results.

Feel free to contact me at oclivio {{at}} live . fr if you would like to explain your analysis more in detail but in a private setting!

Maarten-vd-Sande commented 5 years ago

Hi @oscarclivio, Thanks for your fast reply, very useful. I think I misunderstood the use of zero-inflation. I thought it meant that there are 'too many' zeros in the data, however it seems that it means you'd need e.g. a zero-inflated negative binomial (or beta-poisson distribution) to fit these values properly.

For now you've answered my questions, but I know how to reach you :smile:

Thanks!

oscarclivio commented 5 years ago

Hi @Maarten-vd-Sande,

Thanks for your own quick reply!

Exactly, zero-inflation refers more than a component of a statistical model than a drawback of your data that must be suppressed. Feel free to let me know (here again or by email) if you have any other question!