how to improve detection of short CNV ?

lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing

https://bioconductor.org/packages/devel/bioc/html/PureCN.html

Artistic License 2.0

127 stars 32 forks source link

how to improve detection of short CNV ? #351

Closed lmanchon closed 7 months ago

lmanchon commented 7 months ago

--Hi,

i can't detect short CNV. Is there a way to improve the detection ? I use these parameters:

Rscript $PURECN/PureCN.R --out $OUTPUTS/$lib --tumor $OUTPUTS/${lib}"_recalibrated_coverage_loess.txt.gz" --sampleid $lib --vcf $OUTPUT_GATK_CNV/${lib}"_Mutect2.vcf.gz" --fun-segmentation PSCBS --max-segments 350 --sex=diploid --cores=16 --min-base-quality 20 --min-af 0.01 --normaldb $OUT_REF/normalDB_TWIST_hg38.rds --mapping-bias-file $OUT_REF/mapping_bias_TWIST_hg38.rds --intervals $OUT_REF/DIGI_BED_hg38_intervals.txt --genome hg38 --model betabin --minpurity 0.05 --max-copy-number 8 --rds=rda --max-non-clonal 0.3 --post-optimize --seed 123 --force

Thank you --

lima1 commented 7 months ago

Can you post the complete output of this example? It’s a signal to noise issue. The minimum size segments are 3-4 intervals.

Note that min-purity smaller than the default isn’t recommended because allele-specific copy number determination cannot be done in such low purity samples. For amplification calling in low purity, there is a function independent of purity and ploidy.

lmanchon commented 7 months ago

--Hi,

see log file in attachment. by the way, when i use Hclust segmentation with segment file from CNVkit i find the expected short CNV.

purecn.log

lima1 commented 7 months ago

INFO [2024-02-19 11:47:31] AT/GC dropout: 1.03 (tumor), 1.04 (normal), 0.97 (coverage log-ratio).

This is a bit of a red flag. The GC bias of the log-ratio should be 1.0. I would try without the explicit GC-normalization. So rebuild the normal database using the coverage.txt.gz file, NOT coverage_loess.txt.gz. Then provide the coverage.txt.gz tumor file in PureCN.R (in other words, ignore the loess.txt.gz file everywhere).

Another red flag is:

INFO [2024-02-19 11:47:36] 30.2% of targets contain variants.

This is usually about 15% and likely the reason you get

INFO [2024-02-19 11:47:35] Initial testing for significant sample cross-contamination: maybe

If you are sure there is no contamination (and you get this with many samples), then this likely happens because the input VCF contains many artifacts at SNP positions.

The base quality score filtering in recent Mutect 2 is kind of a mess. So you might have to do your own artifact filtering upstream of PureCN (let me know if you can find solutions here).

lmanchon commented 7 months ago

--hi, ok, first i'm going to try providing the coverage.txt.gz tumor file and check the results. I'll keep you informed thank you --

lima1 commented 7 months ago

Make sure to follow the GATK Mutect2 best practices closely in case these SNP calls are artifacts, not contamination.

lmanchon commented 7 months ago

--Hi,

i have tested with the coverage.txt.gz tumor file and nothing change, the _dnacopy.seg file gives me 3 segments for chr20: 20 66836 26436231 486 -0.0736794059153809 1 20 26436232 30038348 23 -0.132815177542588 1 20 30038349 64199195 766 -0.11534490327593 1

whereas CNVkit gives me that: chr20 60501 1407634 1.95 DEL 0.10076 chr20 1408134 1706980 1.25 DEL 0.027027 chr20 1707480 26256532 1.99 DEL 0.395292 chr20 28256402 30102426 2.11 DUP 0.142411 chr20 30103546 31598003 2.28 DUP 0.278834 chr20 31598503 58854332 1.93 DEL 0.205939 chr20 58854332 58894059 2.64 DUP 0.04945 chr20 58895588 64198679 1.93 DEL 0.152635

the region: chr20 30103546 31598003 2.28 DUP 0.278834 is one of the most important region because i have already detected and confirmed by ddPCR. I think CNVkit segmentation is more sensible, i have used hmm-tumor segmentation in CNVkit.

what do you think ? thanx.

lima1 commented 7 months ago

Can you post new log file and screenshot of b-allele frequency plot? If the SNPs look bad, they could hurt the segmentation.

lmanchon commented 7 months ago

see attachment here.

15_IC_1000.log 15_IC_1000_chromosomes.pdf

lima1 commented 7 months ago

You can see that this sample is clearly contaminated with another sample. The SNPs at the top are homozygous SNPs where the contaminated sample provided some reference alleles. The bottom band of SNPs are reference homozygous in the sample, but contaminated sample provided alt alleles. The coverage is high, so not artifacts.

This results in quite a lot of noise, which reduces sensitivity.

I don't think that's a good benchmark sample.

CNVkit in general is more sensitive, but tends to over-segment a bit with default parameters, which can hurt purity/ploidy and LOH calling.

lima1 commented 7 months ago

Also note that PureCN is a somatic caller, it's not designed to call germline (indeed it will tend to ignore regions that are variable in the normal database).

lima1 commented 7 months ago

Feel free re-open if something comes up.

lmanchon commented 7 months ago

no worries. thank you --