scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.26k stars 357 forks source link

Non-zero entries marked as dropouts #177

Closed pedrofale closed 6 years ago

pedrofale commented 6 years ago

Hey guys, when I run scVI I get a lot of non-zero entries with probability of dropout > 0.5. This doesn't seem right. How can it be prevented? I did not find any protection against this in the code.

Edouard360 commented 6 years ago

Hi @pedrofale,

There is a serious bug in the code. As I tried reproducing the paper result with pytorch's implementation, I encountered that same issue. In https://github.com/YosefLab/scVI/blob/master/docs/notebooks/scVI-reproducibility.ipynb, Figure 13.b), Instead of ranging from 0.1 to 0.4, to zero probability from the negative binomial is in the 0.05 0.1 range, which is worrying.

Besides, something related with the uncertainties might be wrong as suggested by the DE plot in Figure 12, especially the null distribution which is very spiky.

I very recently realized there was no dropout in the decoder part of the scVI model in the tensorflow's implementation - in the standard default parameters, ie if n_layers = 1. (ref: https://github.com/romain-lopez/scVI-reproducibility/blob/2e6c3f9eec953ab0b7c24691911ad02a6010bd8a/scVI.py#L410)

This is now corrected, but there might still be a big difference in the model that I am not seeing. This only change that we made theorically was changing the fully connected ordering of the layers in the neural networks architecture, from:

To:

It might be something as big as that difference I hadn't spotted, or something related to the pytorch/tensorflow's different optimization/regularization schemes (AdamOptimizer, BatchNorm implementation, ect...)

I am actively looking into it, but if you are familiar with the previous tensorflow implementation and that you might have an idea on the origin of that problem, I would be glad to have your input ! 🙂

Meanwhile, there is still the reconstruction_loss='nb' parameter available in the model instanciation - though this is not a solution to that problem of course...

pedrofale commented 6 years ago

Good to know it’s not happening just to me. I actually ran the TensorFlow implementation and the issue is still there. I don’t think it is due to the optimization procedure.

Why not just add a line of code saying that if the (n,p) entry in the observation matrix is non-zero, the dropout probability should be forced to zero? Eddie notifications@github.com escreveu em sáb, 1/09/2018 às 00:27 :

Hi @pedrofale https://github.com/pedrofale,

There is a serious bug in the code. As I was tried reproducing the paper result with pytorch's implementation, I encountered that same issue. In https://github.com/YosefLab/scVI/blob/master/docs/notebooks/scVI-reproducibility.ipynb, Figure 13.b), Instead of ranging from 0.1 to 0.4, to zero probability from the negative binomial is in the 0.05 0.1 range, which is worrying.

Besides, something related with the uncertainties might be wrong as suggested by the DE plot in Figure 12, especially the null distribution which is very spiky.

I very recently realized there was no dropout in the decoder part of the scVI model in the tensorflow's implementation - in the standard default parameters, ie if n_layers = 1. (ref: https://github.com/romain-lopez/scVI-reproducibility/blob/2e6c3f9eec953ab0b7c24691911ad02a6010bd8a/scVI.py#L410 )

This is now corrected, but there might still be a big difference in the model that I am not seeing. This only change that we made theorically was changing the fully connected ordering of the layers in the neural networks architecture, from:

  • Dropout, Dense, Batch Norm, Relu

To:

  • Dense, Batch Norm, Relu, Dropout

It might be something as big as that difference I hadn't spotted, or something related to the pytorch/tensorflow's different optimization/regularization schemes (AdamOptimizer, BatchNorm implementation, ect...)

I am actively looking into it, but if you are familiar with the previous tensorflow implementation and that you might have an idea on the origin of that problem, I would be glad to have your input ! 🙂

Meanwhile, there is still the reconstruction_loss='nb' parameter available in the model instanciation - though this is not a solution to that problem of course...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/YosefLab/scVI/issues/177#issuecomment-417813057, or mute the thread https://github.com/notifications/unsubscribe-auth/AS-XVdBYkIAKFE_jMNWtSkqda5fi7DZ_ks5uWcZjgaJpZM4WV7AY .

Edouard360 commented 6 years ago

Ah that is interesting... thanks for pointing out

screen shot 2018-08-31 at 17 08 46

Actually, we already optimize for the dropout to be zero in case of non zero entries.

(f_h(z)j gets decreased for any non zero entry)

pedrofale commented 6 years ago

Yes but notice that the loss depends on the latent variables, and the dropout probability is given by a mapping from z (which is shared with the NB term). My guess is that the z’s that produce a low NB loss aren’t somehow compatible with the ones that produce a low ZI loss, and the f_h mapping only controls the dropout probability gene-wise, which is limiting.

Not sure if I am making myself clear, but I am not sure if just setting the loss as you did, and rightfully so, is enough to prevent these false dropouts from appearing. Eddie notifications@github.com escreveu em sáb, 1/09/2018 às 01:11 :

Ah that is interesting... thank you pointing out

[image: screen shot 2018-08-31 at 17 08 46] https://user-images.githubusercontent.com/15527397/44940240-a571a200-ad40-11e8-857f-11f0575eaebf.png

Actually, we already optimize for the dropout to be zero in case of non zero entries.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/YosefLab/scVI/issues/177#issuecomment-417817941, or mute the thread https://github.com/notifications/unsubscribe-auth/AS-XVUkoBij65gQorL7-gfSHal61rEIeks5uWdCVgaJpZM4WV7AY .

Edouard360 commented 6 years ago

I see. It makes me think of Figure 13 in the preprint where it is shown how the zero probability from the ZI is correlated with the qc metrics. Perhaps we should include this kind of information from the raw metadata in the decoding.

pedrofale commented 6 years ago

My thoughts exactly. Let me know if you end up changing something regarding this problem. Eddie notifications@github.com escreveu em sáb, 1/09/2018 às 01:33 :

I see. It makes me think of Figure 13 in the preprint where it is shown how the zero probability from the ZI is correlated with the qc metrics. Perhaps we should include this kind of information from the raw metadata in the decoding.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/YosefLab/scVI/issues/177#issuecomment-417819716, or mute the thread https://github.com/notifications/unsubscribe-auth/AS-XVQwtM-bji4mxvtxy-dSFW3gEZJtqks5uWdXQgaJpZM4WV7AY .

jeff-regier commented 6 years ago

Pedro and Eddie, I'm not sure you're talking about the same issue. Perhaps because "dropout" has two meanings, both related to scVI? As I understand it, Pedro is concerned about the number of biological dropout events that scVI detects (that is, where the observed gene expression level is zero but the real expression level isn't zero). I think Eddie is talking about a bug he just fixed, where we were applying dropout regularization to the wrong layer of our neural network. Is that right?

The bug Eddie fixed only affected our pytorch implementation (i.e. this repo). Pedro, you say the tensorflow code had the same issue? Why does the zero-inflation seem too high?

Thanks for reporting the issue!

pedrofale commented 6 years ago

Yes I am talking about biological dropouts. I do not know what the source of the problem is, but when you fit the Tensorflow scVI model to the Zeisel data for example and look at the posterior probability that a given entry is dropout (i.e., Bernoulli probability > 0.5), there are non-zero entries for which this probability is indeed > 0.5. Jeffrey Regier notifications@github.com escreveu em ter, 4/09/2018 às 23:11 :

Pedro and Eddie, I'm not sure you're talking about the same issue. Perhaps because "dropout" has two meanings, both related to scVI? As I understand it, Pedro is concerned about the number of biological dropout events that scVI detects (that is, where the observed gene expression level is zero but the real expression level isn't zero). I think Eddie is talking about a bug he just fixed, where we were applying dropout regularization to the wrong layer of our neural network. Is that right?

The bug Eddie fixed only affected our pytorch implementation (i.e. this repo). Pedro, you say the tensorflow code had the same issue? Why does the zero-inflation seem too high?

Thanks for reporting the issue!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/YosefLab/scVI/issues/177#issuecomment-418534591, or mute the thread https://github.com/notifications/unsubscribe-auth/AS-XVTzBT9h4q7Kimzh_fXo-H6bQabcjks5uXvqPgaJpZM4WV7AY .

jeff-regier commented 6 years ago

That doesn't seem wrong to me, necessarily. I don't think that value can be interpreted as a posterior probability (or an approximate posterior probability). Is that what you're doing? I agree that if the observed expression level isn't zero, then, in the posterior, the probability of a biological dropout event having occurred is zero.

pedrofale commented 6 years ago

Yes that’s what I am doing. How do you suggest obtaining the posterior probability of dropout then? Jeffrey Regier notifications@github.com escreveu em qua, 5/09/2018 às 01:54 :

That doesn't seem wrong to me, necessarily. I don't think that value can be interpreted as a posterior probability (or an approximate posterior probability). Is that what you're doing? I agree that if the observed expression level isn't zero, then, in the posterior, the probability of a biological dropout event having occurred is zero.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/YosefLab/scVI/issues/177#issuecomment-418563887, or mute the thread https://github.com/notifications/unsubscribe-auth/AS-XVW8ZVgrn5c_fPD0POR-bVMS-EQaVks5uXyDVgaJpZM4WV7AY .

jeff-regier commented 6 years ago

We integrate h (the dropout event) out of the model analytically before performing variational inference. Our procedure approximates p(z, l | x), where z is the latent representation for a cell's biology (excludes batch effects), l is the library size for a cell, and x is the data (i.e., the observed expression levels).

It's a bit tricky to approximate p(h | x). Is that the probability you want? That might require a different approximate inference procedure.

pedrofale commented 6 years ago

Yes I understand that. But while it is integrated out, the neural network f_h that decodes from z to h is optimized, and its output is the dropout probability, according to the manuscript. So after optimization, passing the posterior z through f_h should give the posterior dropout probability for each point. Right?

Jeffrey Regier notifications@github.com escreveu no dia quarta, 5/09/2018 às 19:47:

We integrate h (the dropout event) out of the model analytically before performing variational inference. Our procedure approximates p(z, l | x), where z is the latent representation for a cell's biology (excludes batch effects) and l is the library size for a cell.

It's a bit tricky to approximate p(h | x). Is that the probability you want? That might require a different approximate inference procedure.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/YosefLab/scVI/issues/177#issuecomment-418839620, or mute the thread https://github.com/notifications/unsubscribe-auth/AS-XVYGqpupP3aTlsN-vbT75WZbQUz-3ks5uYBxdgaJpZM4WV7AY .

jeff-regier commented 6 years ago

No, I don't think the procedure you describe approximates the posterior dropout probabilities p(h | x). The procedure you describe computes something that is more like the probability---if you could somehow observe that same cell twice--- that the gene would be a dropout the second time, conditioning on your first observation. We may need to revisit what we wrote about this topic in manuscript. It's a pretty subtle (and important) issue.

pedrofale commented 6 years ago

I see. What you say makes sense. But then the way the loss is computed kind of contradicts that, because the dropout probability term used is exactly the value obtained as I described. Jeffrey Regier notifications@github.com escreveu em qua, 5/09/2018 às 20:52 :

No, I don't think the procedure you describe approximates the posterior dropout probabilities p(h | x). The procedure you describe computes something that is more like the probability---if you could somehow observe that same cell twice--- that the gene would be a dropout the second time, conditioning on your first observation. We may need to revisit what we wrote about this topic in manuscript. It's a pretty subtle (and important) issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/YosefLab/scVI/issues/177#issuecomment-418859652, or mute the thread https://github.com/notifications/unsubscribe-auth/AS-XVUN9S-8Oks8d3DQp0IcCvCqn4cohks5uYCtlgaJpZM4WV7AY .

jeff-regier commented 6 years ago

Which loss do you mean, when you say "the way loss is computed"? Not the loss that optimized at training time, right? Do you mean the loss in one of the tables/figures in the paper on biorxiv?

jeff-regier commented 6 years ago

@romain-lopez Any idea what loss Pedro might be referring to? Let's close this issue if this isn't something we can fix or clarifying in manuscript.

romain-lopez commented 6 years ago

Hi, what appear in the loss is a conditional probability p(x | z) for a fixed z and not a whole posterior distribution. Also,

The procedure you describe computes something that is more like the probability---if you could somehow observe that same cell twice--- that the gene would be a dropout the second time, conditioning on your first observation.

This looks accurate to me.