lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

Effect of --max-copy-number parameter on purity/ploidy and copy number estimates #321

Closed ddrichel closed 10 months ago

ddrichel commented 1 year ago

I am running simulations (whole-exome, bam files generated with synggen 1.5, 1 tumor and 2 normals for NormalDB, loess normalization, simulated purity=0.8, expected ploidy~2.8) for PureCN benchmarking, and I was confused by the effect of --max-copy-number, specifically why the copy numbers are off by roughly a factor of 2 if the --max-copy-number is too high. I believe I have figured it out, so this is more of a question whether my understanding is correct, and if it is, a suggestion to improve the part in the documentation that I found to be confusing:

"Defaults are well calibrated and should produce close to ideal results for most samples. A few common cases where changing defaults makes sense: High purity and high quality: For cancer types with a high expected purity, such as ovarian cancer, AND when quality is expected to be very good (high coverage, young samples), --max-copy-number 8. (PureCN reports copy numbers greater than this value, but will stop fitting SNP allelic fractions to the exact allele-specific copy number because this will get impossible very quickly with high copy numbers - and computationally expensive.)"

The experimental setup is:

for maxCN in 4 5 6 7 8 9 10 11 12 13 14 15 20 50 100
do
Rscript $PURECN/PureCN.R --out $OUT/${SAMPLEID}_maxCN${maxCN}_with-vcf  --tumor $OUT/tumor/${SAMPLEID}_coverage_loess.txt.gz  --sampleid $OUT/${SAMPLEID}_loess_maxCN${maxCN}  --normaldb $OUT/normalDB_experiment2_loess_hg38.rds --intervals $OUT/S07604624_Covered_human_all_v6_plus_UTR.liftover.to.hg38_chr.cleaned.sorted.intervals_no_off_targets.txt  --genome hg38 --force --max-copy-number $maxCN --vcf synggen_NV_T_1_tc08_v02_R.vcf.gz
done

My log2 ratio spectrum from *segmentation.pdf:

log2_ratios_maxCN12

Plot of best estimates for ploidy/purity for different maxCN:

synggen_NV_T_1_tc08_v02_R_maxCNXXX_with-vcf_purity_ploidy

Plot of CN in simulated truth vs. PureCN output at --max-copy-number 12, segments in truth and PureCN output are matched with bedtools intersect -f 0.5 (reciprocal overlap 0.5):

CN_truth_vs_pureCN_maxCN12

It seems that at maxCN=12, unreliable (due to noise) high raw log2 ratios in the purity/ploidy search lead to the wrong optimal solution, scaling the CN in the output by ~2x - which intuitively makes sense, as scaling the whole spectrum of log2-ratios by 2x should provide the next-best fit, at lower estimated purity. At --max-copy-number 13 and 14, the newly included, high, but less noisy log2 ratios happen to outweigh the noise, leading to correct global purity/estimates again.

Thanks in advance

Dmitriy

lima1 commented 1 year ago

Hi Dmitry, thanks a lot for your effort here! I indeed struggled a bit with the default for --max-copy-number. Non-focal gains with more than 7 copies are very rare (they do happen though!). 8 is indeed special in that it is even, that means it can explain segments with balanced ~0.5 allelic fraction SNPs. That in turns means you give the search space an additional high ploidy solution to explain the data, assuming many SNPs are still balanced. In high quality, high purity samples, you might still want to include the 8 because they do happen, but be aware that it increases the search space more than you might think.

You are right, I should make this more clear in the --help section.

I should say the math to punish more complex solutions could be improved. It's kind of a mess because we have the SNPs and log ratio fitting and then also support sub-clonal gains. It's on my TODO to clean this up, but I assume that's partly what you are seeing. It's too cheap for the model to pick a high complex solution.

ddrichel commented 1 year ago

Hi Markus, thanks, seems that I am on the right track. It very well might be that the math can be improved, but based on my benchmarking results, PureCN already works well, although the simulated data has its limitations and certainly can be improved, too.

I was also confused by the following:

  1. Description of DB in --out-vcf:

`

INFO=

`

It seems that "dbSNP membership" and similar is used other places in the documentation and comments instead of "membership in germline databases". In my case, DB is determined from POPAF, which in turn is annotated from gnomAD. Took me a while to understand that POPAF has priority over dbSNP, so the description was misleading regardless of whether dbSNP annotations were available or not.

  1. In setPriorVcf.R:
    [..] Last two values are optional, if vcf contains a flag Cosmic.CNT, it
    #' will set the prior probability for variants with CNT > 2 to the first of
    #' those values in case of no matched normal available (0.995 default).  Final
    #' value is for the case that variant is in both dbSNP and COSMIC > 2.

Shouldn't "> 2" by default be "> 6", as in the new default value of --min-cosmic-cnt?

I will create a pull request with proposed changes to the documentation, please have a look whether I got it right.

lima1 commented 8 months ago

Hi @ddrichel , since you have done such a nice simulation benchmark, I thought I'll reach out to you first before I figure out how to do that. Looking for getting some simulated male samples with low-level gains of chrX genes (AR for example). Did you use bam surgeon (ah, see you use syngen) or something and can share a few scripts by any chance? No worries obviously if not!

ddrichel commented 8 months ago

Hi Markus, I will be happy to help as far as I can. This was work I have done as an independent contractor for a client, but chances are good that I will get permission to share with you privately. I will reach out to you via e-mail.

In summary, we generated WES fastq files with simulated CNVs using synggen. This is more or less straightforward. The interesting part in my opinion is the benchmarking procedure itself, as it is not quite clear from the onset how to match test and truth CNVs, as the matching needs to be flexible with respect to spatial coordinates, the matches are not necessarily 1:1, and so on...