lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

Question about different version. #353

Closed Shenglai closed 7 months ago

Shenglai commented 7 months ago

My name is Shenglai Li, and I am reaching out from GDC regarding an issue we encountered while investigating the possibility of upgrading the version of PureCN for our new release.

We have been using PureCN version 2.2.0 for our production environment, and it has been functioning well. However, upon testing the upgraded version (2.6.4 and the latest), we encountered failures when using a specific capture kit (Sureselect v5). Unfortunately, due to data privacy concerns, we are unable to share the data with you for debugging purposes.

I have attached the logs from both versions (purecn.2.2.0.log and purecn.2.6.4.log) for your reference.

purecn.2.2.0.log purecn.2.6.4.log

I apologize that the uuids and file names do not look like the same but I am pretty sure there are a few cases they failed on new version and passed on older version. Also I tried not only version 2.6.4 but also the latest version, I think they will fail as well.

Given the circumstances, we would greatly appreciate your insight on which parts of the code we should investigate independently. Any guidance or suggestions you can provide would be immensely helpful in resolving this issue.

lima1 commented 7 months ago

It should not crash like that, but looks like no variants are passing filters. Might be related to #320 . I would turn off the base quality filter and make sure this is dealt with upstream.

lima1 commented 7 months ago

Also the population allele frequency check labels only 7 as germline. Make sure that germline are not filtered out (especially tumor/normal pairs).

Shenglai commented 7 months ago

I think we ran it with tumor only samples. The upstream was handle by GATK4 Mutect2 pipeline (4.2.4). Will definitely try turn off the base quality filter. The upstream is running https://github.com/NCI-GDC/gatk4_mutect2_cwl/blob/master/subworkflows/gatk4.2.4.1_mutect2_workflow.cwl FYI. I believe it's running Mutect2 best practices filtering only. (Filtering alignment artifacts, and etc.)

lima1 commented 7 months ago

I think 4.2.4 should not suffer from the BQ issue. Can you check that the exact same sample works with old PureCN?

Shenglai commented 7 months ago

I'm still in the progress of checking the sample that works with old PureCN. (Sorry for the delay. Our system is under migration.) However, for the one that I posted with only 7 as germline, it fails at both versions. purecn.2.2.0.log purecn.2.8.1.log Is there anything I can do to get it pass by PureCN? For this particular sample, from the Mutect2 VCF, there's no variants labeled as germline nor panel_of_normals unfortunately.

lima1 commented 7 months ago

I assume that's a tumor/normal pair that was not run in Mutect2 with the flags to genotype germline? Then there is nothing we can do, PureCN needs germline calls. The runtime is not much longer and the filtering is trivial, so it wouldn't be a big change for your T/N pipeline. Alternatively, you can merge somatic and germline VCFs, but that's more of headache. You can also skip any somatic variants, but this then obviously won't give you the subclonality. In samples without many CNVs, somatic mutations also provide some purity signal (like in microsatellite high CRC).

Shenglai commented 7 months ago

Unfortunately it's tumor only that's run in Mutect2. :(

lima1 commented 7 months ago

You'll figure it out (maybe from reading the GATK logs) :-) Probably some germline filtering going on somewhere.

Shenglai commented 7 months ago

Thanks for your input. I don't think I have the initial issue. At least now I actually can not find the jobs that failed on later version would pass on old version. Will close the issue and thanks for the help again!