lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
125 stars 32 forks source link

Single standard deviation vs. segment-centric standard deviation for logR likelihood #361

Closed tinyheero closed 3 months ago

tinyheero commented 3 months ago

Hi @lima1,

This is more of a question than an issue.

I was re-reading the PureCN paper and reviewing the equation of the original paper (https://scfbm.biomedcentral.com/articles/10.1186/s13029-016-0060-z):

$$ r{i} \sim N \Bigg(log{2} \frac{pC{i} + (1-p)2}{p\Big(\sum{j}l{j}C{j}\Big) / \sum{j}l{j} + (1-p)2}, \sigma_{ri} \Bigg) $$

The standard deviation ($\sigma_{ri}$) of the normal distribution is set to be:

the average standard deviation of log-ratios in a segment

I am just curious as to what the rationale is for using a single standard deviation value (learned across all segments) rather than have the standard deviation set to be segment-centric?

lima1 commented 3 months ago

Hi @tinyheero . Since version 1.10, the standard deviation used in this equation is optimized as well during the Simulated Annealing. But still a single one for all segments. I'm open for suggestions how to improve. The segments will converge to a very similar value with increasing segment size, so I feel it's probably fine. There is some outlier filtering happening, so small segments where segment log ratio differs from global one due to technical reasons still should not have a dramatic impact.

But that part of the code is unchanged for many years, so I can't fully remember every single decision/benchmarking leading to it.

tinyheero commented 3 months ago

Thanks @lima1 for your reply.

I agree that it probably doesn't make much difference.