hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
189 stars 58 forks source link

Purple PoissonDistribution error occurs at negative copyNumber values for regions with somatic SNVs #589

Closed tmfreeman400 closed 2 months ago

tmfreeman400 commented 2 months ago

Thank you for your tool. I have found Purple to give accurate results for most samples I have tested it on.

I have a sample that I am able to run Purple on successfully without using somatic SNVs, although purple is unable to estimate purity, probably due to the lack of somatic CNVs in the sample. To improve the purity estimation I am providing a -somatic_vcf to Purple.

There are a small number of variants that, when included in the somatic_vcf file, trigger this error:

java -jar $SCRIPTPATH/purple_test.jar -reference $CONTROLSAMPLE -tumor $TESTSAMPLE -output_dir $OUTPATH/purple/$TESTSAMPLE -amber $OUTPATH/amber/$TESTSAMPLE -cobalt $OUTPATH/cobalt/$TESTSAMPLE -gc_profile $REFPATH/GC_profile.1000bp.38.cnp -ref_genome $REFFASTA -ref_genome_version V38 -ensembl_data_dir $REFPATH/ensembl_data -circos /opt/conda/envs/hmftools_docker_env/bin/circos -somatic_vcf //sage/tumour/tumour.sage.vcf.gz 00:34:04.340 [INFO ] Purple version 4.0.2 00:34:04.356 [INFO ] output directory: /purple/tumour/ 00:34:04.369 [INFO ] reference(normal) tumor(tumour) 00:34:04.405 [INFO ] using ref genome: V38 00:34:05.400 [INFO ] reading GC Profiles from GC_profile.1000bp.38.cnp 00:34:06.828 [INFO ] reading Amber QC from /amber/tumour/tumour.amber.qc 00:34:06.833 [INFO ] reading Amber BAFs from /amber/tumour/tumour.amber.baf.tsv.gz 00:34:07.418 [INFO ] reading Amber PCFs from /amber/tumour/tumour.amber.baf.pcf 00:34:07.465 [INFO ] Amber average tumor depth(47) ambiguous BAF(0.565) 00:34:07.482 [INFO ] reading Cobalt tumor segments from /cobalt/tumour/tumour.cobalt.ratio.pcf 00:34:07.488 [INFO ] reading Cobalt ratios from /cobalt/tumour/tumour.cobalt.ratio.tsv.gz 00:34:10.001 [INFO ] reading Cobalt reference segments from /cobalt/tumour/normal.cobalt.ratio.pcf 00:34:13.188 [INFO ] loaded 722624 somatic variants from /sage/tumour/sage.vcf.gz 00:34:13.188 [INFO ] sample gender is female 00:34:13.188 [INFO ] applying segmentation 00:34:13.189 [INFO ] merging reference and tumor ratio break points 00:34:21.640 [INFO ] purple output directory: /purple/tumour/ 00:34:21.705 [INFO ] fitting purity 00:34:40.753 [INFO ] maxDiploidProportion(0.982) diploidCandidates(93) purityRange(0.140 - 1.000) hasTumor(false) 00:34:40.755 [INFO ] calculating copy number 00:34:40.844 [INFO ] modelling somatic peaks 00:34:46.627 [ERROR] failed processing sample(tumour): org.apache.commons.math3.exception.NotStrictlyPositiveException: mean (-�) org.apache.commons.math3.exception.NotStrictlyPositiveException: mean (-�) at org.apache.commons.math3.distribution.PoissonDistribution.(PoissonDistribution.java:126) at org.apache.commons.math3.distribution.PoissonDistribution.(PoissonDistribution.java:103) at org.apache.commons.math3.distribution.PoissonDistribution.(PoissonDistribution.java:80) at com.hartwig.hmftools.purple.somatic.SomaticPurityEnrichment.isBiallelic(SomaticPurityEnrichment.java:108) at com.hartwig.hmftools.purple.somatic.SomaticPurityEnrichment.applyPurityAdjustment(SomaticPurityEnrichment.java:76) at com.hartwig.hmftools.purple.somatic.SomaticPurityEnrichment.processVariant(SomaticPurityEnrichment.java:62) at com.hartwig.hmftools.purple.somatic.SomaticVariantCache.lambda$purityEnrich$0(SomaticVariantCache.java:109) at java.base/java.util.ArrayList.forEach(ArrayList.java:1597) at com.hartwig.hmftools.purple.somatic.SomaticVariantCache.purityEnrich(SomaticVariantCache.java:109) at com.hartwig.hmftools.purple.PurpleApplication.performFit(PurpleApplication.java:316) at com.hartwig.hmftools.purple.PurpleApplication.run(PurpleApplication.java:157) at com.hartwig.hmftools.purple.PurpleApplication.main(PurpleApplication.java:618)

These variants appear to support a copy number of at least 2 and their presence triggers the error: chr10 79390527 . T G 31.6 PASS F GT:DP:AD:AF 0/1:20:11,9:0.45 0/1:52:22,30:0.5769 chr10 79396998 . G A 33.17 PASS F GT:DP:AD:AF 0/1:25:14,11:0.44 0/1:56:25,31:0.5536 ...

If I remove these variants from the somatic_vcf file, then there are no problems. Since these variants don't look particularly unusual compared to other variants in the file, except for the fact that they occur in a specific genomic region, I looked at the purple.cnv.somatic.tsv file (when not including these somatic variants) to check if there was something unusual about that region, and it showed that the problematic variants precisely corresponded to a region with a negative copyNumber which I believe is triggering the org.apache.commons.math3.exception.NotStrictlyPositiveException: mean (-�) error when it tries to input a negative value into a Poisson distribution calculation that is carried out when somatic variants occur within the negative copyNumber region. In contrast, the vast majority of regions that were unaffected had positive copyNumber values:

chromosome start end copyNumber bafCount observedBAF baf segmentStartSupport segmentEndSupport method depthWindowCount gcContent minStart maxStart minorAlleleCopyNumber majorAlleleCopyNumber chr10 1 40640101 1.9836 16333 0.5559 0.5829 TELOMERE CENTROMERE BAF_WEIGHTED 35604 0.4101 1 1 0.8274 1.1562 chr10 40640102 46776500 3.7373 1850 0.5534 0.5351 CENTROMERE NONE BAF_WEIGHTED 3076 0.4420 40640102 40640102 1.7373 2.0000 chr10 46776501 79385000 2.0181 11900 0.5558 0.6482 NONE NONE BAF_WEIGHTED 29043 0.4040 46561001 46992001 0.7100 1.3080 chr10 79385001 79425000 -5.6108 37 0.6679 1.1608 NONE NONE BAF_WEIGHTED 38 0.5482 79385001 79385001 0.9024 -6.5132 chr10 79425001 86157000 1.1087 3132 0.5625 0.8971 NONE NONE BAF_WEIGHTED 5947 0.3960 79425001 79425001 0.1141 0.9946 chr10 86157001 86166000 2.2283 9 0.7381 3.1958 NONE NONE BAF_WEIGHTED 9 0.5060 86157001 86157001 0.0000 2.2283 chr10 86166001 126441000 2.4090 14638 0.5560 0.5823 NONE NONE BAF_WEIGHTED 37644 0.4211 86166001 86166001 1.0062 1.4028 chr10 126441001 133797422 1.3976 3233 0.5605 0.7252 NONE TELOMERE BAF_WEIGHTED 6259 0.4705 126434001 126441001 0.3841 1.0135

I know from reading other posts that the decision to include negative copyNumber values in Purple's output is intentional. Would it be possible to include a fix to Purple so that where these occur, it does not cause this error when somatic variants overlap it? This would be valuable for ensuring that Purple consistently runs without errors in pipelines when using -somatic_vcf.

Please let me know if there is anything else I can do to help with troubleshooting this. Many thanks, Tim

p-priestley commented 2 months ago

Thanks for the clear bug description. I am surprised we have not seen this issue ourselves. We will fix this for the next release

I think that your data looks a little odd though. It is very unusual to fit such a negative copy number so it is likely indicative of some bigger problem. Is there anything unusual about your samples? Also it is unexpected to me that the minor allele copy number is greater than 0 but the overal copy number is negative. If it is possible to share the segment.tsv file with me (or at least say chr10) at p.priestley@hartwigmedicalfoundation.nl I could try to diagnose.

As you mention we have deliberately not forced negative implied copy numbers to be 0 so that we don't cover up potential issues like this.

tmfreeman400 commented 2 months ago

Thank you very much - I greatly appreciate that.

This data is from Oxford Nanopore long read sequencing and a very high gamma value was used in cobalt to improve smoothing (-pcf_gamma 1000) due to the low numbers of reads per genomic bin that are characteristic of ONT.

We also have Illumina data for the same sample, which has a ccube purity estimate of 96% and did not have any large copy number changes in the Illumina data. It is an AML sample.

For context, running Purple on most ONT samples gives very close purity values compared to Illumina/ccube (usually within 5%), so this isn't a common feature of ONT data but might be related.

I've just emailed you separately also.

charlesshale commented 2 months ago

Poisson calc crash is fixed in the next release (v4.1)