lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

How to set up sex of PON #324

Closed lmanchon closed 11 months ago

lmanchon commented 1 year ago

Describe the issue build PON from multiples samples, all samples are males

To Reproduce Rscript $PURECN/Coverage.R --out-dir $OUTPUTS/normals --bam normals.list --intervals $OUT_REF/hg38_intervals.txt --cores 12 --force ls -a $OUTPUTS/normals/*_loess.txt.gz | cat > PON_normal_coverages.list Rscript $PURECN/NormalDB.R --out-dir $OUT_REF --coverage-files PON_normal_coverages.list --genome hg38 --assay PON --force

How to set up sex male if all my normal samples are males ? I know that in cnvkit tool i can set --male-reference for male when i build reference (PON).

thank you--

lima1 commented 1 year ago

It's not working automatically? Internally it will create sex-specific PoNs. Not sure I ever tested this thoroughly, but I expect it to work.

lmanchon commented 1 year ago

it's strange i have no result on X and Y, PureCN detect anomalies on autosomal chromosomes but not on X and Y, do i set some parameter to recover anomalies on chrX and chr Y ?

lima1 commented 1 year ago

Ah, got it. Yes, the likelihood model only supports diploid genomes, so it ignores X/Y for males.

Keeping the X/Y segmentation is a fair feature request though. Not sure I find time for it soon.

lima1 commented 1 year ago

You might be able to force the use of X/Y by providing sex="diploid". Not sure there are any side effects though. But might be enough to get the segmentation and merge it then back to the regular run.

lmanchon commented 1 year ago

is NormalDB.R script accept the parameter sex="diploid" ?

lima1 commented 1 year ago

PureCN.R.

lmanchon commented 1 year ago

oh yes of course, sorry, I confused normalDb with PureCN

lmanchon commented 1 year ago

and i have another question because i obtain too much noisy results: is there a parameter to adjust (such as --drop-low-coverage (per bin) in cnvkit) to reduce the noisy background ?

lmanchon commented 1 year ago

my command was: Rscript $PURECN/PureCN.R --out $OUTPUTS/$lib --tumor $OUTPUTS/${lib}"_recalibrated_coverage_loess.txt.gz" --sampleid $lib --vcf $OUTPUT_GATK_CNV/${lib}"_Mutect2.vcf.gz" --stats-file $OUTPUT_GATK_CNV/${lib}"_Mutect2.vcf.gz.stats" --fun-segmentation PSCBS --sex=diploid --cores=16 --normaldb $OUT_REF/normalDB_hg38.rds --mapping-bias-file $OUT_REF/mapping_bias_TWIST_hg38.rds --intervals $OUT_REF/BED_hg38_intervals.txt --genome hg38 --model betabin --post-optimize --seed 123

lima1 commented 1 year ago

Can you post the log file? It will tell you exactly what it used for filtering low quality bins.

The --stats-file is only for Mutect1. Also have a look at #320 if you have many MBQ 20 variants.

lmanchon commented 1 year ago

PSX-AD-001.log here attachment log file

lima1 commented 1 year ago

Looks fine, but you indeed suffer the same issue with MBQ 20. Install the version from the issue_320 branch: BiocManager::install("lima1/PureCN", ref = "issue_320")

And then re-run PureCN.R with --min-base-quality 20

lmanchon commented 1 year ago

ok i have installed the version from issue_320

Rscript $PURECN/PureCN.R -v 2.7.11

then: Rscript $PURECN/PureCN.R --out $OUTPUTS/$lib --tumor $OUTPUTS/${lib}"_recalibrated_coverage_loess.txt.gz" --sampleid $lib --vcf $OUTPUT_GATK_CNV/${lib}"_Mutect2.vcf.gz" --fun-segmentation PSCBS --sex=diploid --cores=16 --min-base-quality 20 --normaldb $OUT_REF/normalDB_TWIST_hg38.rds --mapping-bias-file $OUT_REF/mapping_bias_TWIST_hg38.rds --intervals $OUT_REF/DIGI_BED_hg38_intervals.txt --genome hg38 --model betabin --post-optimize --seed 123

and same output with the same number of CNVs.

lima1 commented 1 year ago

You should have now a much higher number of SNPs in the log file.

INFO [2023-09-19 16:42:29] 2.0% of targets contain variants.

That should be now > 10%.

That number should have dropped significantly:

INFO [2023-09-19 16:42:28] Removing 14110 low quality variants with non-offset BQ < 25.

lmanchon commented 1 year ago

ok, yes i understand. I have used PureCN and CNVkit, the 2 tools give me approximatively the same number of variants ~ 140 but when i use GATK, i obtain only 40 CNVs, GATK seems to be more stringent in calling variants, and it's not easy to deduce the best results.

lima1 commented 1 year ago

You can use the GATK segmentation with PureCN via --fun-segmentation GATK. That should get you close to GATK, but with off-target read support.

PSCBS has a few unique features to combine off-target and on-target reads a bit smarter in the segmentation though. For low purity, I recommend PSCBS, for higher purity (like usually > 30%), GATK is probably cleaner.

It's a balance of sensitivity and specificity. CNVkit is tweaked for sensitivity and expected to give more segments.

lmanchon commented 1 year ago

yes. maybe by adding another caller I could make a difference: XHMM, Codex2 ?