Performance with multinomial likelihood

pedrofale commented 5 years ago

Hi Kieran,

Did you assess the performance of the new version of the package versus the previous one? I have noticed some things:

it seems to be rather sensitive to the initial_shrinks parameter;
the ELBO is not a good proxy for best assignments.
In my experiments, a high value of initial_shrinks seems to give more stability. Does this make sense? What does this parameter actually do? It's hard to get it from the code.

It would be nice to have some sort of idea about how this new model works. Maybe update the paper? Thanks!

kieranrcampbell commented 5 years ago

Hi @pedrofale

Thanks for your message.

In the model there are variational parameters for each cell to be assigned to each clone (gamma in the original paper). There's a question of what these should be initialized to -- the two extremes are either (i) you set it to 1 / C for C clones, or (ii) you assign each cell based on the model probabilities at the initial values for all other parameters. Case (i) corresponds to initial_shrink = 0 and case (ii) I think is the asymptotic limit as initial_shrink -> \infty but in practice is true for any initial_shrink > 10 ish.

In the "original" clonealign, (ii) was always the case. I didn't realize that this was in fact a free parameter however, so you could envisage a case where such a hard initial assignment actually takes your model into a local optima that is far from the global optima. That said, I don't think I've seen a case where run_clonealign has picked initial_shrink to be small, implying the initial hard assignment was a good proxy.

In terms of evaluating the fit, choosing the parameter initialization that maximizes the ELBO over multiple runs seemed the best way to go. When you say "ELBO is not a good proxy for best assignments" what are you basing assignment quality on? I've found it hard to come up with an alternative objective measure (you could imagine e.g. post hoc correlations of genome-transcriptome). Any suggestions are most welcome.

I can't update the paper in the sense that it correctly describes the method we used for the original analysis, but I might add a pdf to the github explaining the new model updates.

Kieran

pedrofale commented 5 years ago

Thank you very much the great answer!

Yes, I meant the post hoc correlation. In particular, I am experimenting with the SA501 data from your paper and when I set initial_shrinks=0 I get a better ELBO than with a large value -- but in the former case I end up with almost all cells being assigned to clone B! This may be of course due to some bad pre-processing of the data on my side. I will continue to look into it.

I was thinking that it's strange to replace a Negative Binomial model with a Multinomial instead of a Dirichlet-Multinomial, as it may lead to the model being less robust to the overdispersion in the data. I suppose some filtering can help -- but then it's hard to evaluate how much CNV signal we lose. Would it be possible to provide the NB likelihood as a choice in the API?

Thank you again for your time!

kieranrcampbell commented 5 years ago

Yes, I meant the post hoc correlation. In particular, I am experimenting with the SA501 data from your paper and when I set initial_shrinks=0 I get a better ELBO than with a large value -- but in the former case I end up with almost all cells being assigned to clone B! This may be of course due to some bad pre-processing of the data on my side. I will continue to look into it.

Interesting. It worries me a little that the model might be intrinsically unidentifiable, leading to stuff like this. How are you selecting your genes for input to clonealign? This ended up being a bit of a dark art, but I'm considering adding a function to do it

I was thinking that it's strange to replace a Negative Binomial model with a Multinomial instead of a Dirichlet-Multinomial, as it may lead to the model being less robust to the overdispersion in the data. I suppose some filtering can help -- but then it's hard to evaluate how much CNV signal we lose. Would it be possible to provide the NB likelihood as a choice in the API?

I'll give you my thinking on this but happy for discussion / alternatives. In RNA-seq, in a sense, the "true" data generating mechanism is multinomial: you sample S reads (based on your sequencing budget), and each gene is sampled in proportion to its fractional prevalence (say p_g) so the expected counts are S * p_g. So in the original model we essentially had

likelihood = NegBin(mean = S *p_g, dispersion = rbf function of mean)

where you can justify a functional form of p_g that is clone dependent (main equation in the paper).

So this is all fine up to here except why are we using a negative binomial model for a multinomial generative model (except that everyone else does it?) unless we really believe the data are so overdispersed that this isn't the case. The side benefit is this is now much faster since we got rid of that rbf function (long story). So my solution was (i) go to multinomial and (ii) have some post-hoc filtering to remove highly outlying genes. And now this is fast enough we can run it multiple times over different parameter settings and pick the version with the highest ELBO.

In terms of dirichlet multinomial, I'm not sure how you would directly translate p_g from above (which is easy to derive and therefore easy to model with a multinomial) into some form of prior counts?

kieranrcampbell commented 5 years ago

Also -- for SA501, do you get the best post-hoc correlations when the cells are assigned as per the scWGS prevalences (A = 80% ish, B = 15% ish, C = 5% ish) ? Maybe the ELBO estimate is so noisy it's not a good selection criteria

pedrofale commented 5 years ago

For filtering, I am going through the steps outlined in run_clonealign.Rmd from your clonealign-pipeline repository, so keeping the top 50% variable genes in expression which also vary in CNV across clones.

I totally understand your reasoning regarding the choice of Multinomial, thanks for the nice description. I just would like to see if in practice this is a safe assumption due to the possibly large overdispersion that it does not acommodate, and the results I am getting on the SA501 data seem to indicate otherwise -- if I'm using it correctly.

Which brings me to your final point: when it goes "well", the percentages I obtain are instead (with a variance of about 10%) 60, 30 and 10. It does not look very good, which is why I haven't ruled out some bug on my side! (I can only continue experimenting next week though as I am going on vacation at this very moment :-) ). If you have the possibility to try it I would be very happy to know what results you got, and with which parameters.

Thanks again!

kieranrcampbell commented 5 years ago

Okay thanks. I'll have a play around this week. This is an important point. I think the issue is the original clonealign, which worked well, basically just jumped into a local optima, so now we've opened the pandora's box of mutliple restarts, how you choose the best one is hard.

As a side note -- 60, 30, 10, is actually okay -- since (i) the assignments are uncertain, and (ii) the prevalences from the scWGS are uncertain, and (ii) the RNA and DNA are taken from different passages so we don't expect them to align exactly (and actually the RNA comes from the passage before the DNA where clone A should have less prevalence). If you plot the expression of XIST by clone, this is a really good diagnostic to tell it's working (for SA501) since clone A has lost an X copy.

I'll work on automating gene selection as there are a few edge cases that are annoying.

kieranrcampbell commented 5 years ago

Played around with this today. On SA501, you're right in that there's not much correlation between ELBO and correlation as a measure of which is better, but intuitively you could argue for either. Also the initial_shrink = 0 never gets selected as far as I can tell, so I think I'll change the default to 5,10, 20.

The main difference from what you described is that I never saw clone B get the majority assignment? I always had A at much higher than B or C, and the plots of Xist show it looks okay. So I think overall the multinomial is still good.

I've made a gist of how I ran clonealign -- but I realize it includes quite a lot of pre-processing, so I'm going to wrap it up into a function. The processed data can be found here.

On my system this gives prevalences (A,B,C) of (0.633672117, 0.192664605, 0.172337605) and Xist looks like

xistplot

and if you subset that to high confidence clones it looks better.

pedrofale commented 5 years ago

Thank you for this nice reference. I will try to reproduce it next week when I am back. The fact that in my runs the best ELBO was achieved at initial_shrinks = 0 may also be due to a bad ELBO estimate, as it was only better by a small bit, so I will also try increasing the number of MC samples.

Nevertheless, the main difference to yours may definitely be in the pre-processing. I did some very basic retaining of the top 50% HVGs as quantified by their log pseudocounts variance. I will follow the same steps as you and report back next week!

kieranrcampbell commented 5 years ago

Played around with this a little more: the single most important filtering step seems to be retaining only autosomes for analysis (since e.g. loss of X is probably inactive copy)

kieranrcampbell / clonealign

Performance with multinomial likelihood #11