lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
125 stars 32 forks source link

Multi region curation #362

Closed jberg1999 closed 2 months ago

jberg1999 commented 3 months ago

Hi Markus,

Thank you for creating such a useful and transparent copy number caller!

I am working on a dataset with multiple regions per patient, and have run pureCN on each sample. I have found that the in most cases there is pretty strong agreement between regions of the same tumor, but in other cases there are a lot of conserved breakpoints but the actual copy number of the breakpoints is ploidy shifted (for example balanced diploid vs balanced tetraploid). Do you think it is advisable to do manual curation across regions of the same tumor? If so, do you have any recommendations for how to approach this? I was thinking maybe mapping the solutions of each region to their equivalent solutions in other regions (either by ploidy or by the actual segmentation) and then determining which set of similar solutions has the highest overall score across all regions. Do you think this makes sense?

lima1 commented 3 months ago

Hi @jberg1999 , unfortunately the joint segmentation of multiple samples is currently not available because the copynumber package was removed from Bioconductor. But will get back to this soon, GATK supports this. ASCAT has a function for this too, but this would take a bit more work getting a wrapper done. This will reduce the artifactual differences between regions and thus hopefully reduces the requirement for manual curation.

Having multiple regions makes the curation easier. Not sure the score helps, but visually checking which one is more likely should be easy (with a bit of experience) for most. Feel free to post the B-allele frequency plot for a few pairs you are not sure.

jberg1999 commented 3 months ago

Thank you very much for your quick response! I think for now I will stick with PureCN's segmentation for now and just focus on the curation aspect. I am always weary of manual curation because it can easily lead to bias based on the curator's own preconceived notions of what the results should look like. I would ideally like to break down curation into some common sense rules that I can then systematically apply to each sample or group of samples so that I can be very transparent about the process.

I have been looking at #310 and some of the other issues and it seems like for any individual sample we switch off of a high ploidy solution based on the following.

  1. The high ploidy solution does not have a sufficiently large balanced 1/1 segment at a lower coverage
  2. The high ploidy solution does not contain a balance of states between 2 and 5.
  3. The high ploidy solution is not mostly 2/2

Are there any other things that you look for in deciding between higher and lower ploidy solutions? I am not sure what would exclude a low ploidy solution other than extensive LOH. I might show a couple of tricky ones but I would have to ask my collaborators first.

lima1 commented 3 months ago

That's a good starting point - if you are lucky and have multiple balanced segments with differing log ratio, it's usually pretty clear what is correct.

With a bit of experience, most samples are quite obvious even without those balanced segments. It can become more difficult (i.e. ambiguous) in cancer types with lots of sub-clonal alterations (see the ABSOLUTE paper, NSCLC is probably the worst here).

Without too many sub-clonal alterations, low ploidy solutions usually just have a single dominating log-ratio peak around 0 and then small peaks for the gains and losses. High ploidy solutions have multiple peaks generated after the genome duplications and consecutive gains and losses.

If you need to curate lots of samples, like more than 15%, you might be able to tweak your PureCN setup to get better results out-of-the-box. In my benchmarks, I hit a plateau over 90% accuracy for ploidy. It's a whack-a-mole thing where every fix then introduces another issue. Improving the sub-clonal calling could help a bit, and I started working on it a while ago but getting distracted by other things.

Wouldn't be probably too difficult to build a Deep Learning model on the B-allele frequency plot, but no plans for that right now.

jberg1999 commented 2 months ago

Hi Markus,

Thank you for your help with this! I was able to curate my samples reasonably well based of this discussion. For now I think we can close this issue. If something else related to curation comes up I may reopen it.