Optimum length of regions and positioning of SNPs

waemm commented 5 years ago

Hi,

I just had two questions about what the minimum region length that PureCN can use? I see in filterIntervals you set a default minimum of 5bp - does this imply any size region (as long as it is in the baits) can potentially be used to improve the CNV calling? Is there an optimal size range for this?

Secondly, for the SNPs, is it more beneficial to cover extra SNPs within the regions you would expect to see CNVs in cancer or is a random shotgun approach more beneficial?

Thanks for a great piece of software!

lima1 commented 5 years ago

Hi Warren,

Good questions, that should probably be clearer in the documentation.

1) If possible, I would recommend using the actual baits file (the locations of the baits used to capture the target regions). These regions are uniform in size (unless they overlap) and the GC content of the baits should match the GC content of the captured read pileups better. Baits are usually >50bp, so this 5bp cutoff should not filter anything. If for some reason you only have access to the locations of the targeted exons, IntervalFile.R from version 1.14.0 can, and by default will, resize small targets (default to 10bp) by padding left and right. I haven't done tests to see what the best size is, but there shouldn't be that many small exons. You could manually open a BAM file in IGV and check if the read pileups are worth including.

2) Both. Are you designing a new panel? If you want to design your own SNP backbone, I would recommend a 1Mb tiling (or smaller if you can afford). So find all the gaps between the targets and fill it with SNPs that have a population allele frequency of 35 to 65% and filter by outlier GC content (or you can restrict your SNPs to the ones used by vendors of SNP arrays). Then regions of relevant LOH or frequent homozygous deletions can hugely benefit from additional SNP probes. Amplifications are easy to find with >=6 probes even in noisy samples and allele-specific copy number is unreliable for those anyways (no need for het. SNP information, coverage provides sufficient information).

Hope that helps, Markus

waemm commented 5 years ago

Thanks Markus, that was very helpful. At this stage just trying to understand how best to tweak PureCN and what would give the largest positive impact! I am busy running some test data from an old previously published study but I am finding only around 30% of overlap between tumor-only PureCN output vs the tumor-normal analysis. I have some questions about this as well, not sure if I should ask here or if you prefer I can open another issue to discuss.

lima1 commented 5 years ago

Tumor-only and matched should provide very, very similar results, so 30% overlap (purity/ploidy concordance? Amplifications/Deletions?) indicates a problem in one of the setups. Feel free to provide more detail (maybe provide two log-files of a sample with large difference in the two runs).

The recommended setup is listed in the Quick vignette using with the PureCN normalization/segmentation. CNVkit is fine, but it tends to over-segment, which leads to longer runtime and lower sensitivity to detect LOH.

PureCN with the PSCBS segmentation might bring some additional improvements:

Install patched PSCBS: if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("lima1/PSCBS", ref="add_dnacopy_weighting") BiocManager::install("lima1/PureCN")

Then run PureCN.R with --funsegmentation PSCBS. For CNVkit, you can try --funsegmentation Hclust to reduce over-segmentation.

waemm commented 5 years ago

Hi Markus, thanks again for a really helpful reply! So PSCBS can only be run on a clean PureCN run from the BAM files? I will definitely keep this in mind if you think it could improve results. At present I'm using the output of CNVkit for both PON and tumor sample.

After doing some digging the reported purities of the tumor samples are all >96%. I assume this is too high for accurate estimation? It would be good to know what the upper limit for accurate quantification is for PureCN.

Its also worth noting that the average coverage of the WES here is below 100x, roughly 70-90x, done using an old sure select kit. Not really ideal by any means but the only data we could get access to for testing purposes. Let me know if you think this could adequately explain the discrepancies.

Purity/Ploidy estimate results from PureCN are below: ` "Sampleid","Purity","Ploidy","Sex","Contamination","Flagged","Failed","Curated","Comment" "001_tumor-cnvkit",0.29,1.99987578528143,"F",0,TRUE,FALSE,FALSE,"LOW PURITY" "002_tumor-cnvkit",0.28,2.16736129039155,"M",0,TRUE,FALSE,FALSE,"LOW PURITY" "003_tumor-cnvkit",0.28,2.22624090692454,"M",0.0488154356752795,TRUE,FALSE,FALSE,"LOW PURITY;POTENTIAL SAMPLE CONTAMINATION"

"005_tumor-cnvkit",0.69,2.01703292845763,"M",0,TRUE,FALSE,FALSE,"NON-ABERRANT"

"006_tumor-cnvkit",0.19,2.08624896694472,"M",0,TRUE,FALSE,FALSE,"LOW PURITY" `

lima1 commented 5 years ago

PureCN supports cell lines for tumor-only, but you need to tell her. See the Quick vignette. There should be a detailed log file. I would also need a screenshot of the BAF plot. Looks like they don’t have a lot of CNAs. Are these maybe heme or pediatric samples?

waemm commented 5 years ago

Yes, good guess, they are B lymphocytes from chronic lymphocytic leukemia.

As an additional question, I have been looking into tuning the vcf info DB tag a bit more carefully, to potentially select a much smaller number of potential somatic variants, as far as I understand this should give most common SNPs a low prior and those that are potentially somatic a prior at 0.5. Would this help improve the results from PureCN or would you expect it to manage without too much extra help?

Thanks again for your help Markus.

Log file attached for both the original run and the run with cell line parameters.

CLL001_tumor.log [CLL001_tumor_cell_line_test.log] (https://github.com/lima1/PureCN/files/3272208/CLL001_tumor_cell_line_test.log)

lima1 commented 5 years ago

Are you sure you ran CNVkit with a normal database for coverage normalization? The coverage normalization looks pretty bad.

For >95% purity, the germline vs somatic classification unfortunately doesn't work anyways (except for sub-clonal somatic vs germline), so I doubt that changes a lot. You can provide population allele frequencies as well and provide a cutoff, by default it's a POP_AF info field.

For tumor-only, I'd also recommend to generate the mapping_bias.rds file. This makes sure that SNPs in poor quality regions are automatically ignored.

lima1 commented 5 years ago

Also, you can easily install the latest stable version 1.14.0 with BiocManager::install("lima1/PureCN", ref="RELEASE_3_9").

Once things look better, you can run PureCN.R with --postoptimize (and --parallel if necessary) to get slightly more accurate purity.

waemm commented 5 years ago

Thanks for your comments Markus. Yes, we used a background normal file with CNVkit and are currently using a mapping bias file. My feeling is the data is probably quite poor quality, but will see what we can do to improve results. Thanks again for your help!

lima1 commented 5 years ago

How many normals do you have? Even just a handful can dramatically improve the normalization over a single normal control. The BAF plot you attached looks like there is no normal used at all. You sure the normals are from the same capture kit?

lima1 commented 5 years ago

The most import QC metric are duplication rates. There is pretty much a linear relationship of noise and duplication, so check that first. PureCN's normalization was designed for challenging clinical samples.

waemm commented 5 years ago

Hi Markus, thanks, you were correct, we were not adding the background properly. We have 5 normal samples at present. We had run the background but were sourcing the wrong files. After rerunning with correct files the the BAF plot is below using cell line settings.

Just out of interest, do you estimate the rates from the plots? Are you referring to the number of black lines deviating from 0.5 ?

lima1 commented 5 years ago

Duplication rates are fraction of reads marked as PCR duplicates. You can get this from Picard. PureCN's Coverage.R also reports very basic QC metrics including this rate. The log2-ratio standard deviation is essentially the PureCN estimated noise. You can find it in the log file or in the output RDS file (ret$input$log.ratio.sdev).

Looks better, but still noisy. You can try re-running PureCN.R with --funsegmentation Hclust to reduce over-segmentation. There might be also CNVkit settings to tweak.

Unless the duplication rates are very high (>75%), my guess is that PureCN normalization with PSCBS (see above) might work well in your case. Our defaults are more calibrated towards challenging clinical samples.

lima1 commented 5 years ago

And if you are not sure that the purity is > 0.9, you can reduce min.purity to whatever you feel comfortable. --modelhomozygous is the important flag. This will keep homozygous variants, which could be actually heterozygous in normal (LOH). (The lower purity, the more likely we will sequence reference alleles from the normal contamination)

waemm commented 5 years ago

Great, thanks again. The above was run with HClust but will give it a go with PSCBS. I have checked the duplication rates and they're all sub 10%. Like you say, this data is not ideal for testing PureCN, in future we will be getting clinical samples so hopefully will go much smoother!

lima1 commented 5 years ago

Sub 10% is great and expected for heme. Make sure to get a few normals as similar as possible to the tumors. My guess that there are still differences between tumors and your normal controls, but yeah, try following the Quick vignette with PSCBS. And make sure that min.purity is correct (for example checking allelic fractions of hotspots). There are clear drops and gains in coverage, but these could be sub-clonal.

lima1 / PureCN

Optimum length of regions and positioning of SNPs #89