lculibrk / Ploidetect

Tumour purity, ploidy, and copy number variation from whole-genome sequence data
6 stars 0 forks source link

Ploidetect v1.4.2 unusual MAF values across genome #15

Open tmfreeman400 opened 4 months ago

tmfreeman400 commented 4 months ago

Dear Ploidetect developers,

I am currently benchmarking Ploidetect v1.4.2 on matched tumour-normal hg38 ONT data in order to check how it compares against other CNV callers for potential use in an ONT clinical pipeline. I am using the snakemake pipeline --use-singularity method detailed at https://github.com/lculibrk/Ploidetect-pipeline and would like to ask a couple of questions to help with troubleshooting the results I am getting.

I already know the CNV status of the samples I am using for benchmarking from sequencing with both short read and long read sequencing pipelines, and I get very different results with Ploidetect that are not supported by the (real) MAF and coverage values, so I would like to check what might be causing this.

For example, I have a sample which has no CNVs in most chromosomes, which should have recorded MAF values of roughly 0.5 across these chromosomes and CN=2 calls, however Ploidetect consistently gives higher CN values across these chromosomes (mostly CN=3), and more than half of the individual bins in cna.txt have MAF values closer to 0.55-0.70.

The sample in question has a tumour coverage of roughly 40x rather than the recommended 80x so I have already tried using a larger window_threshold of 200000 in the config/defaults.yaml file to account for the tumor genome having more noise, but this did not appear to make a difference. For comparison, HATCHet2 calculates BAF values of roughly 0.42-0.48 for these regions, which corresponds to what is seen in the underlying data for the heterozygous germline SNPs. The mapping quality and basecalling quality of the data are good, and small and structural variant calling is accurate, so the ONT qual threshold of 10 in the defaults.yaml file seems fine. I can't see any other parameters in the config files that look like they would resolve the above issue.

To help with this issue, could you please let me know:

1) Does Ploidetect have known issues with calling CNVs accurately for samples in which most chromosomes do not have large CNVs, which could explain these results? Are there any plans to resolve these issues with future versions of Ploidetect?

2) Are there any other parameters I can change in Ploidetect that could reduce the false positive CNV calls?

In addition, I have noticed that the cna_plots.pdf output produced appears to be missing chromosome labels, genomic position information and MAF values on the corresponding plots - I am attaching the pdf to this message. Is this a known bug? I don't need these plots myself, but I thought I should flag this to you as the developer of Ploidetect so you are aware of it. cna_plots.pdf

Thank you very much for your assistance, Tim

lculibrk commented 4 months ago

Hi @tmfreeman400, I'm sorry you're having trouble!

Does Ploidetect have known issues with calling CNVs accurately for samples in which most chromosomes do not have large CNVs, which could explain these results? Are there any plans to resolve these issues with future versions of Ploidetect?

Yes, this is an issue with ploidy calling. Basically it's difficult to determine ploidy when the genome is quiet (as is the case in your case) or the data are noisy (as is the case with 40x ONT). This can be resolved/ameliorated by creating a tab-separated text file with your expected purity/ploidy values like so:

tp    ploidy
1   2

and giving this to the -m paramter of ploidetect_copynumber.R. This will force the CNV caller to perform CNV calling with these parameters fixed, and the CNV caller is highly dependent on the purity/ploidy value so it knows the expected read depths of CNVs. I believe this addresses your second question.

I'm not aware of how HATCHet2 counts allele frequencies, however Ploidetect's method of counting AFs is fairly basic in this regard.

Based on the cna_plots.pdf, there is certainly an issue with the plotting functions that I fear in this case might be a symptom of other issues with CNV calling, but I can't be sure. Could you attach the cna_condensed.txt as well?

tmfreeman400 commented 4 months ago

Hi @lculibrk , Thank you for this very fast response! I'll give it a test with the purity/ploidy values pre-specified in the ploidetect_copynumber.R command.

I am attaching the cna_condensed.txt output to this message to help with investigating the plotting function: cna_condensed.txt

lculibrk commented 4 months ago

Hi @tmfreeman400,

Do the chromosome names of your bams have "chr" prefixes? If so, this is definitely a bug and I'll get to fixing that as soon as possible.

tmfreeman400 commented 4 months ago

Hi @tmfreeman400,

Do the chromosome names of your bams have "chr" prefixes? If so, this is definitely a bug and I'll get to fixing that as soon as possible.

Yes, the chromosome names have "chr" prefixes in the BAMs. E.g. "chr1"

lculibrk commented 4 months ago

Thank you for the information - The plotting bug is visual and should not affect the results. It will be fixed.