Confused by bestSd in get_loglike_matrix function

evolvedmicrobe commented 7 years ago

I had a question regarding how you are modeling the effect of duplications and deletions on expected copy number that I could not determine from your paper, I was hoping you could help.

In the code, the get_loglike_matrix function calculates the expected log likelihood for the default copy number of 2 in a straightforward way.

https://github.com/evolvedmicrobe/ExomeDepth/blob/master/src/CNV_estimate.cpp#L75-L77

However, I do not understand how you are changing this estimate to account for copy numbers of 1 and 3, and in particular the role of the bestSd variable. Based on the paper and my understanding of the code, it seems that if a target is deleted, the expectation changes from X / (X + Y) into .5X / (.5X + Y), this seems like a straightforward way to change the expected proportion of reads in the test set under the deletion hypothesis.

However, this change to the expectation seems to only occur in one place and there is a second variable, bestSd that is not updated.

https://github.com/evolvedmicrobe/ExomeDepth/blob/master/src/CNV_estimate.cpp#L73

Since the a1 and a2 variables used to calculate the likelihood are derived from the expected proportion and overdispersion parameters, shouldn't they also be updated to reflect the difference in the expected proportion? Any help understanding this appreciated.

vplagnol commented 7 years ago

Thanks for the note Nigel. Code goes quite a way back so it's hard to recall why I made these choices back then and I can surely be convinced these changes could be optimized.

My rationale was that in the presence of a deletion/duplication clearly the expectation changes, that's obvious. Now the betabinomial conditional on the "p" parameter is binomial, and there is a sd error on that "p" parameter to make a betabinomial. My view I suppose at the time is that there was no clear parameterisation of that standard error, and I might as well keep it constant. This is not an obvious choice and I could see some scaling making sense there, but is that obvious? I suppose one could parameterize the sd by a factor 0.5 or 1.5 depending on deletion/duplication but I am also nit sure this is the wisest choice.

evolvedmicrobe commented 7 years ago

Hi Vincent,

Ah, okay that makes sense. Agreed that it isn't obvious what the reparameterization of the binomial should be when the mean changes, but at least I now understand the choice that was made, thank you!

So you know the context of this, we were benchmarking a few different methods and yours was the highest performing, so we were trying to understand the model better. Thanks for the quick reply!

Warm wishes, Nigel

vplagnol commented 7 years ago

Thanks Nigel, always nice to get good feedback. This correlation based approach really works best with large facilities and very automated processing to reduce sample to sample variability. That's where exomeDepth can do a good job.

vplagnol / ExomeDepth

Confused by bestSd in get_loglike_matrix function #8