lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

PureCN.R returns many warnings. Possible issue with Panel of Normals. #332

Closed alextrouerntrend closed 11 months ago

alextrouerntrend commented 1 year ago

Hi, I am running PureCN on a set of variants combined from mutect2 and varscan2 results and using a panel of normals derived from 1000G exon capture samples. Running PureCN has produced a set of warnings, some of which I do not see in the existing issues. Here are the warnings that came up across the 18 samples I am running:

WARN AD field misses ref counts. WARN Did not find base quality scores, will use global error rate of 0.0010 instead. WARN High GC-bias in normalized tumor vs normal log2 ratio. WARN --intervals does not contain gene symbols. Not generating gene-level calls. WARN Low number of off-target intervals. You might want to exclude them (182161 on-target, 17889 off-target, ratio 0.09). WARN vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead.

I can see that the VarScan2 vcf formatting is the cause of the first “AD missing” error. The reporting of ref and alt alleles are split into two different INFO tags, instead of being reported under AD as a vector. I think I can find a way to modify vcf inputs to contain DB info field. Are base quality scores necessary and if so, is there a recommended procedure for including base quality scores in the VCF? Although PureCN was able to return results for Purity and Ploidy, I am concerned about the high GC-bias in the normalized tumor vs normal log2 ratio and the low number of off-target intervals. I believe that the 1000G samples that I used for the panel of normals use the same exon capture kit as my samples, however I’m aware of a ~10-fold difference in the coverage. 1000G samples have ~30x while my samples have ~275x coverage. Could this discrepancy be contributing to these patterns? I’m concerned that some upstream aspect of the analysis is causing these warnings, and I would be grateful for suggestions on troubleshooting.

To Reproduce

## Commands ###
#Generate an interval file from a BED file containing baits coordinates
Rscript $PURECN/IntervalFile.R --in-file $BAITS \
        --fasta $REF_GEN --out-file $OUT_REF/baits_hg38_intervals.txt \
        --off-target --genome hg38 \
        --export $OUT_REF/baits_optimized_grch38.bed \
        --mappability $MAP

#Calculate and GC-normalize coverage from a BAM file
for SAMPLEID in $( ls ${SAMPLEDIR}/*.bam ); do
        SAMPNAME=$( echo $SAMPLEID | xargs -n 1 basename )
        mkdir -p $COV/${SAMPNAME%.bam}
        Rscript $PURECN/Coverage.R --outdir $COV/${SAMPNAME%.bam} \
                --bam ${SAMPLEID} \
                --intervals $OUT_REF/baits_hg38_intervals.txt \
                --cores 24
done
# Create a normals list from 1000G exome alignment bams
ls ./exome_alignment/*bam > normals.list

# Calculate and GC-normalize coverage from a list of BAM files
mkdir -p $COV/normals
Rscript $PURECN/Coverage.R --out-dir $COV/normals \
        --bam normals.list \
        --intervals $OUT_REF/baits_hg38_intervals.txt \
        --cores 24

ls -a $COV/normals/*_loess.txt.gz | cat > example_normal_coverages.list

#create tabix index for merged file
module load samtools
tabix $NORMAL_PANEL

# When normal panel VCF is available (highly recommended for
# unmatched samples)
Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
        --coverage-files example_normal_coverages.list \
        --normal-panel $NORMAL_PANEL \
        --genome hg38 \
        --assay agilent_v6

# Production pipeline run

mkdir -p $OUT/$SAMPLEID

Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \
        --tumor $COV/${SAMPLEID}_accepted_hits/${SAMPLEID}_accepted_hits_coverage_loess.txt.gz \
        --sampleid $SAMPLEID \
        --vcf $TUMOR_VCF/${SAMPLEID}_anno.vcf \
        --fun-segmentation PSCBS \
        --normaldb $OUT_REF/normalDB_agilent_v6_hg38.rds \
        --mapping-bias-file $OUT_REF/mapping_bias_agilent_v6_hg38.rds \
        --intervals $OUT_REF/baits_hg38_intervals.txt \
        --snp-blacklist /blue/scripps-bas/clients/janiszewska/mseq/grch38/blacklist/hg38-blacklist.v2.bed \
        --genome hg38 \
        --model betabin \
        --force --post-optimize --seed 123 \
        --popaf-info-field ANNO.1000G \
        --min-af 0.01

Expected behavior The GC-bias and off-target interval warnings are unexpected, I also do not expect poor goodness of fit warnings.

Log file INFO [2023-06-13 12:30:47] ------------------------------------------------------------ INFO [2023-06-13 12:30:47] PureCN 2.4.0 INFO [2023-06-13 12:30:47] ------------------------------------------------------------ INFO [2023-06-13 12:30:47] Arguments: -tumor.coverage.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/coverage/OS11_accepted_hits/OS11_accepted_hits_coverage_loess.txt.gz -log.ratio -seg.file -vcf.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/filtering/OS11_final.vcf -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/reference_files/mapping_bias_agilent_v6_hg38.rds -args.filterIntervals 100,0.05 -args.segmentation 0.005,NULL, -sampleid OS11 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/reference_files/baits_hg38_intervals.txt -min.logr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field POP_AF -Cosmic.CNT.info.field Cosmic.CNT -model betabin -post.optimize TRUE -BPPARAM -log.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/results/OS11/OS11.log -normal.coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heuristics INFO [2023-06-13 12:30:47] Loading coverage files... INFO [2023-06-13 12:30:50] Mean target coverages: 179X (tumor) 172X (normal). INFO [2023-06-13 12:30:51] Mean coverages: chrX: 153.34, chrY: 64.43, chr1-22: 146.87. INFO [2023-06-13 12:30:51] Mean coverages: chrX: 110.83, chrY: 64.35, chr1-22: 140.48. INFO [2023-06-13 12:30:54] Removing 738 intervals with missing log.ratio. INFO [2023-06-13 12:30:54] Removing 7 low/high GC targets. INFO [2023-06-13 12:30:54] Removing 66853 intervals excluded in normalDB. INFO [2023-06-13 12:30:54] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2023-06-13 12:30:54] Removing 1064 low count (< 100 total reads) intervals. INFO [2023-06-13 12:30:54] Removing 8406 low coverage (< 0.0015X) intervals. WARN [2023-06-13 12:30:54] Low number of off-target intervals. You might want to exclude them (182227 on-target, 18450 off-target, ratio 0.09). INFO [2023-06-13 12:30:54] Removing 17 intervals on non-standard chromosomes not listed in centromeres (chr1_KI270706v1_random,chr1_KI270711v1_random,chr1_KI270766v1_alt,chr4_GL000008v2_random,chr7_KI270803v1_alt,chr8_KI270821v1_alt,chr14_GL000009v2_random,chr14_KI270846v1_alt,chr15_KI270850v1_alt,chr17_KI270857v1_alt,chr17_KI270909v1_alt,chr19_KI270938v1_alt,chr22_KI270879v1_alt,chr22_KI270928v1_alt,chrUn_KI270742v1). INFO [2023-06-13 12:30:54] Using 200660 intervals (182223 on-target, 18437 off-target). INFO [2023-06-13 12:30:55] Ratio of mean on-target vs. off-target read counts: 1.10 INFO [2023-06-13 12:30:55] Mean off-target bin size: 80566 INFO [2023-06-13 12:30:55] AT/GC dropout: 1.06 (tumor), 1.00 (normal), 1.08 (coverage log-ratio). WARN [2023-06-13 12:30:55] High GC-bias in normalized tumor vs normal log2 ratio. INFO [2023-06-13 12:30:55] Loading VCF... INFO [2023-06-13 12:30:55] Found 24688 variants in VCF file. FATAL [2023-06-13 12:30:55] vcf.file has no %s or %s info field for membership in germline

FATAL [2023-06-13 12:30:55] databases.DBPOP_AF

FATAL [2023-06-13 12:30:55]

FATAL [2023-06-13 12:30:55] This is most likely a user error due to invalid input data or

FATAL [2023-06-13 12:30:55] parameters (PureCN 2.4.0).

INFO [2023-09-07 15:26:14] ------------------------------------------------------------ INFO [2023-09-07 15:26:14] PureCN 2.4.0 INFO [2023-09-07 15:26:14] ------------------------------------------------------------ INFO [2023-09-07 15:26:14] Arguments: -tumor.coverage.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/coverage/OS11_accepted_hits/OS11_accepted_hits_coverage_loess.txt.gz -log.ratio -seg.file -vcf.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/filtering/OS11_final.vcf -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/reference_files/mapping_bias_agilent_v6_hg38.rds -args.filterIntervals 100,0.05 -args.segmentation 0.005,NULL, -sampleid OS11 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/reference_files/baits_hg38_intervals.txt -min.logr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field ANNO.1000G -Cosmic.CNT.info.field Cosmic.CNT -model betabin -post.optimize TRUE -BPPARAM -log.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/results/OS11/OS11.log -normal.coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heuristics INFO [2023-09-07 15:26:14] Loading coverage files... INFO [2023-09-07 15:26:17] Mean target coverages: 179X (tumor) 172X (normal). INFO [2023-09-07 15:26:18] Mean coverages: chrX: 153.34, chrY: 64.43, chr1-22: 146.87. INFO [2023-09-07 15:26:18] Mean coverages: chrX: 110.83, chrY: 64.35, chr1-22: 140.48. INFO [2023-09-07 15:26:21] Removing 738 intervals with missing log.ratio. INFO [2023-09-07 15:26:21] Removing 7 low/high GC targets. INFO [2023-09-07 15:26:21] Removing 66853 intervals excluded in normalDB. INFO [2023-09-07 15:26:21] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2023-09-07 15:26:21] Removing 1064 low count (< 100 total reads) intervals. INFO [2023-09-07 15:26:21] Removing 8406 low coverage (< 0.0015X) intervals. WARN [2023-09-07 15:26:21] Low number of off-target intervals. You might want to exclude them (182227 on-target, 18450 off-target, ratio 0.09). INFO [2023-09-07 15:26:21] Removing 17 intervals on non-standard chromosomes not listed in centromeres (chr1_KI270706v1_random,chr1_KI270711v1_random,chr1_KI270766v1_alt,chr4_GL000008v2_random,chr7_KI270803v1_alt,chr8_KI270821v1_alt,chr14_GL000009v2_random,chr14_KI270846v1_alt,chr15_KI270850v1_alt,chr17_KI270857v1_alt,chr17_KI270909v1_alt,chr19_KI270938v1_alt,chr22_KI270879v1_alt,chr22_KI270928v1_alt,chrUn_KI270742v1). INFO [2023-09-07 15:26:21] Using 200660 intervals (182223 on-target, 18437 off-target). INFO [2023-09-07 15:26:21] Ratio of mean on-target vs. off-target read counts: 1.10 INFO [2023-09-07 15:26:21] Mean off-target bin size: 80566 INFO [2023-09-07 15:26:22] AT/GC dropout: 1.06 (tumor), 1.00 (normal), 1.08 (coverage log-ratio). WARN [2023-09-07 15:26:22] High GC-bias in normalized tumor vs normal log2 ratio. INFO [2023-09-07 15:26:22] Loading VCF... INFO [2023-09-07 15:26:22] Found 24688 variants in VCF file. FATAL [2023-09-07 15:26:22] vcf.file has no %s or %s info field for membership in germline

FATAL [2023-09-07 15:26:22] databases.DBANNO.1000G

FATAL [2023-09-07 15:26:22]

FATAL [2023-09-07 15:26:22] This is most likely a user error due to invalid input data or

FATAL [2023-09-07 15:26:22] parameters (PureCN 2.4.0).

INFO [2023-09-07 15:32:18] ------------------------------------------------------------ INFO [2023-09-07 15:32:18] PureCN 2.4.0 INFO [2023-09-07 15:32:18] ------------------------------------------------------------ INFO [2023-09-07 15:32:18] Arguments: -tumor.coverage.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/coverage/OS11_accepted_hits/OS11_accepted_hits_coverage_loess.txt.gz -log.ratio -seg.file -vcf.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/vcf_anno/bcftools/OS11_anno.vcf -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/reference_files/mapping_bias_agilent_v6_hg38.rds -args.filterIntervals 100,0.05 -args.segmentation 0.005,NULL, -sampleid OS11 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/reference_files/baits_hg38_intervals.txt -min.logr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field ANNO.1000G -Cosmic.CNT.info.field Cosmic.CNT -model betabin -post.optimize TRUE -BPPARAM -log.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/results/OS11/OS11.log -normal.coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heuristics INFO [2023-09-07 15:32:18] Loading coverage files... INFO [2023-09-07 15:32:22] Mean target coverages: 179X (tumor) 172X (normal). INFO [2023-09-07 15:32:22] Mean coverages: chrX: 153.34, chrY: 64.43, chr1-22: 146.87. INFO [2023-09-07 15:32:22] Mean coverages: chrX: 110.83, chrY: 64.35, chr1-22: 140.48. INFO [2023-09-07 15:32:25] Removing 738 intervals with missing log.ratio. INFO [2023-09-07 15:32:25] Removing 7 low/high GC targets. INFO [2023-09-07 15:32:25] Removing 66853 intervals excluded in normalDB. INFO [2023-09-07 15:32:25] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2023-09-07 15:32:25] Removing 1064 low count (< 100 total reads) intervals. INFO [2023-09-07 15:32:25] Removing 8406 low coverage (< 0.0015X) intervals. WARN [2023-09-07 15:32:26] Low number of off-target intervals. You might want to exclude them (182227 on-target, 18450 off-target, ratio 0.09). INFO [2023-09-07 15:32:26] Removing 17 intervals on non-standard chromosomes not listed in centromeres (chr1_KI270706v1_random,chr1_KI270711v1_random,chr1_KI270766v1_alt,chr4_GL000008v2_random,chr7_KI270803v1_alt,chr8_KI270821v1_alt,chr14_GL000009v2_random,chr14_KI270846v1_alt,chr15_KI270850v1_alt,chr17_KI270857v1_alt,chr17_KI270909v1_alt,chr19_KI270938v1_alt,chr22_KI270879v1_alt,chr22_KI270928v1_alt,chrUn_KI270742v1). INFO [2023-09-07 15:32:26] Using 200660 intervals (182223 on-target, 18437 off-target). INFO [2023-09-07 15:32:26] Ratio of mean on-target vs. off-target read counts: 1.10 INFO [2023-09-07 15:32:26] Mean off-target bin size: 80566 INFO [2023-09-07 15:32:26] AT/GC dropout: 1.06 (tumor), 1.00 (normal), 1.08 (coverage log-ratio). WARN [2023-09-07 15:32:26] High GC-bias in normalized tumor vs normal log2 ratio. INFO [2023-09-07 15:32:26] Loading VCF... INFO [2023-09-07 15:32:27] Found 94798 variants in VCF file. INFO [2023-09-07 15:32:27] Maximum of POPAP INFO is > 1, assuming -log10 scaled values WARN [2023-09-07 15:32:27] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead. INFO [2023-09-07 15:32:27] 68221 (72.0%) variants annotated as likely germline (DB INFO flag). WARN [2023-09-07 15:32:27] AD field misses ref counts. WARN [2023-09-07 15:32:28] Did not find base quality scores, will use global error rate of 0.0010 instead. INFO [2023-09-07 15:32:29] Sample1 is tumor in VCF file. INFO [2023-09-07 15:32:29] 735 homozygous and 152 heterozygous variants on chrX. INFO [2023-09-07 15:32:29] Sex from VCF: M (Fisher's p-value: < 0.0001, odds-ratio: 8.35). INFO [2023-09-07 15:32:29] Removing 0 low quality variants with non-offset BQ < 25. INFO [2023-09-07 15:32:29] Global base quality score of 29 INFO [2023-09-07 15:32:29] Minimum number of supporting reads ranges from 4 to 9, depending on coverage and BQS. INFO [2023-09-07 15:32:33] Initial testing for significant sample cross-contamination: maybe INFO [2023-09-07 15:32:33] Removing 29139 variants with AF < 0.030 or AF >= 0.970 or insufficient supporting reads or depth < 15. INFO [2023-09-07 15:32:33] Removing 0 blacklisted variants. INFO [2023-09-07 15:32:33] Total size of targeted genomic region: 41.37Mb (55.55Mb with 50bp padding). INFO [2023-09-07 15:32:33] 15.0% of targets contain variants. INFO [2023-09-07 15:32:33] Removing 31771 variants outside intervals. INFO [2023-09-07 15:32:33] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise. INFO [2023-09-07 15:32:33] Loading mapping bias file mapping_bias_agilent_v6_hg38.rds... INFO [2023-09-07 15:33:39] Found 26569405 variants in mapping bias file. INFO [2023-09-07 15:33:56] Imputing mapping bias for 4146 variants... INFO [2023-09-07 15:40:00] Excluding 9131 novel or poor quality variants from segmentation. INFO [2023-09-07 15:40:00] Excluding 4146 variants not in pool of normals from segmentation. INFO [2023-09-07 15:40:00] Sample sex: M INFO [2023-09-07 15:40:00] Segmenting data... FATAL [2023-09-07 15:40:00] segmentationPSCBS requires the PSCBS package.

FATAL [2023-09-07 15:40:00]

FATAL [2023-09-07 15:40:00] This is most likely a user error due to invalid input data or

FATAL [2023-09-07 15:40:00] parameters (PureCN 2.4.0).

INFO [2023-09-08 11:56:44] ------------------------------------------------------------ INFO [2023-09-08 11:56:44] PureCN 2.4.0 INFO [2023-09-08 11:56:44] ------------------------------------------------------------ INFO [2023-09-08 11:56:44] Arguments: -tumor.coverage.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/coverage/OS11_accepted_hits/OS11_accepted_hits_coverage_loess.txt.gz -log.ratio -seg.file -vcf.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/vcf_anno/bcftools/OS11_anno.vcf -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/reference_files/mapping_bias_agilent_v6_hg38.rds -args.filterIntervals 100,0.05 -args.segmentation 0.005,NULL, -sampleid OS11 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/reference_files/baits_hg38_intervals.txt -min.logr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field ANNO.1000G -Cosmic.CNT.info.field Cosmic.CNT -model betabin -post.optimize TRUE -BPPARAM -log.file /blue/scripps-bas/clients/janiszewska/mseq/grch38/purecn/results/OS11/OS11.log -normal.coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heuristics INFO [2023-09-08 11:56:44] Loading coverage files... INFO [2023-09-08 11:56:47] Mean target coverages: 179X (tumor) 172X (normal). INFO [2023-09-08 11:56:47] Mean coverages: chrX: 153.34, chrY: 64.43, chr1-22: 146.87. INFO [2023-09-08 11:56:47] Mean coverages: chrX: 110.83, chrY: 64.35, chr1-22: 140.48. INFO [2023-09-08 11:56:50] Removing 738 intervals with missing log.ratio. INFO [2023-09-08 11:56:50] Removing 7 low/high GC targets. INFO [2023-09-08 11:56:51] Removing 66853 intervals excluded in normalDB. INFO [2023-09-08 11:56:51] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2023-09-08 11:56:51] Removing 1064 low count (< 100 total reads) intervals. INFO [2023-09-08 11:56:51] Removing 8406 low coverage (< 0.0015X) intervals. WARN [2023-09-08 11:56:51] Low number of off-target intervals. You might want to exclude them (182227 on-target, 18450 off-target, ratio 0.09). INFO [2023-09-08 11:56:51] Removing 17 intervals on non-standard chromosomes not listed in centromeres (chr1_KI270706v1_random,chr1_KI270711v1_random,chr1_KI270766v1_alt,chr4_GL000008v2_random,chr7_KI270803v1_alt,chr8_KI270821v1_alt,chr14_GL000009v2_random,chr14_KI270846v1_alt,chr15_KI270850v1_alt,chr17_KI270857v1_alt,chr17_KI270909v1_alt,chr19_KI270938v1_alt,chr22_KI270879v1_alt,chr22_KI270928v1_alt,chrUn_KI270742v1). INFO [2023-09-08 11:56:51] Using 200660 intervals (182223 on-target, 18437 off-target). INFO [2023-09-08 11:56:51] Ratio of mean on-target vs. off-target read counts: 1.10 INFO [2023-09-08 11:56:51] Mean off-target bin size: 80566 INFO [2023-09-08 11:56:51] AT/GC dropout: 1.06 (tumor), 1.00 (normal), 1.08 (coverage log-ratio). WARN [2023-09-08 11:56:51] High GC-bias in normalized tumor vs normal log2 ratio. INFO [2023-09-08 11:56:51] Loading VCF... INFO [2023-09-08 11:56:52] Found 94798 variants in VCF file. INFO [2023-09-08 11:56:52] Maximum of POPAP INFO is > 1, assuming -log10 scaled values WARN [2023-09-08 11:56:52] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead. INFO [2023-09-08 11:56:52] 68221 (72.0%) variants annotated as likely germline (DB INFO flag). WARN [2023-09-08 11:56:52] AD field misses ref counts. WARN [2023-09-08 11:56:54] Did not find base quality scores, will use global error rate of 0.0010 instead. INFO [2023-09-08 11:56:54] Sample1 is tumor in VCF file. INFO [2023-09-08 11:56:54] 735 homozygous and 152 heterozygous variants on chrX. INFO [2023-09-08 11:56:54] Sex from VCF: M (Fisher's p-value: < 0.0001, odds-ratio: 8.35). INFO [2023-09-08 11:56:54] Removing 0 low quality variants with non-offset BQ < 25. INFO [2023-09-08 11:56:54] Global base quality score of 29 INFO [2023-09-08 11:56:54] Minimum number of supporting reads ranges from 4 to 9, depending on coverage and BQS. INFO [2023-09-08 11:56:58] Initial testing for significant sample cross-contamination: maybe INFO [2023-09-08 11:56:58] Removing 27993 variants with AF < 0.010 or AF >= 0.990 or insufficient supporting reads or depth < 15. INFO [2023-09-08 11:56:58] Removing 0 blacklisted variants. INFO [2023-09-08 11:56:58] Total size of targeted genomic region: 41.37Mb (55.55Mb with 50bp padding). INFO [2023-09-08 11:56:58] 15.3% of targets contain variants. INFO [2023-09-08 11:56:59] Removing 32129 variants outside intervals. INFO [2023-09-08 11:56:59] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise. INFO [2023-09-08 11:56:59] Loading mapping bias file mapping_bias_agilent_v6_hg38.rds... INFO [2023-09-08 11:58:03] Found 26569405 variants in mapping bias file. INFO [2023-09-08 11:58:19] Imputing mapping bias for 4474 variants... INFO [2023-09-08 12:04:41] Excluding 9571 novel or poor quality variants from segmentation. INFO [2023-09-08 12:04:41] Excluding 4474 variants not in pool of normals from segmentation. INFO [2023-09-08 12:04:41] Sample sex: M INFO [2023-09-08 12:04:41] Segmenting data... INFO [2023-09-08 12:04:41] Interval weights found, will use weighted PSCBS. INFO [2023-09-08 12:04:41] MAPD of 17967 allelic fractions: 0.05 (0.06 adjusted). INFO [2023-09-08 12:04:42] Setting undo.SD parameter to 1.250000. INFO [2023-09-08 12:04:54] Setting prune.hclust.h parameter to 0.200000. INFO [2023-09-08 12:04:54] Found 82 segments with median size of 9.67Mb. INFO [2023-09-08 12:04:55] Removing 5 variants outside segments. INFO [2023-09-08 12:04:55] Using 34671 variants. INFO [2023-09-08 12:04:56] Mean standard deviation of log-ratios: 0.76 (MAPD: 0.57) INFO [2023-09-08 12:04:56] Mean standard deviation of on-target log-ratios only: 0.78 (MAPD: 0.60) INFO [2023-09-08 12:04:56] Mean standard deviation of off-target log-ratios only: 0.39 (MAPD: 0.25) INFO [2023-09-08 12:04:56] 2D-grid search of purity and ploidy... INFO [2023-09-08 12:07:57] Local optima: 0.48/2.4, 0.35/2, 0.42/1.9, 0.29/2.1, 0.55/3, 0.78/4.2, 0.52/2, 0.95/5.8, 0.78/2 INFO [2023-09-08 12:07:57] Testing local optimum 1/9 at purity 0.48 and total ploidy 2.40... INFO [2023-09-08 12:08:10] Testing local optimum 2/9 at purity 0.35 and total ploidy 2.00... INFO [2023-09-08 12:08:25] Testing local optimum 3/9 at purity 0.42 and total ploidy 1.90... INFO [2023-09-08 12:08:34] Testing local optimum 4/9 at purity 0.29 and total ploidy 2.10... INFO [2023-09-08 12:08:50] Testing local optimum 5/9 at purity 0.55 and total ploidy 3.00... INFO [2023-09-08 12:09:09] Testing local optimum 6/9 at purity 0.78 and total ploidy 4.20... INFO [2023-09-08 12:09:25] Testing local optimum 7/9 at purity 0.52 and total ploidy 2.00... INFO [2023-09-08 12:09:36] Testing local optimum 8/9 at purity 0.95 and total ploidy 5.80... INFO [2023-09-08 12:09:38] Recalibrating log-ratios... INFO [2023-09-08 12:09:38] Testing local optimum 8/9 at purity 0.95 and total ploidy 5.80... INFO [2023-09-08 12:09:50] Recalibrating log-ratios... INFO [2023-09-08 12:09:50] Testing local optimum 8/9 at purity 0.95 and total ploidy 5.80... INFO [2023-09-08 12:10:05] Recalibrating log-ratios... INFO [2023-09-08 12:10:05] Testing local optimum 8/9 at purity 0.95 and total ploidy 5.80... INFO [2023-09-08 12:10:22] Testing local optimum 9/9 at purity 0.78 and total ploidy 2.00... INFO [2023-09-08 12:10:33] Skipping 2 solutions that converged to the same optima. INFO [2023-09-08 12:10:33] Fitting variants with betabin model for local optimum 1/9... INFO [2023-09-08 12:10:33] Fitting variants for purity 0.49, tumor ploidy 2.93 and contamination 0.01. INFO [2023-09-08 12:12:05] Poor goodness-of-fit (0.422). Skipping post-optimization. INFO [2023-09-08 12:12:05] Optimized purity: 0.49 INFO [2023-09-08 12:12:05] Fitting variants with betabin model for local optimum 2/9... INFO [2023-09-08 12:12:05] Fitting variants for purity 0.34, tumor ploidy 2.22 and contamination 0.01. INFO [2023-09-08 12:13:39] Poor goodness-of-fit (0.424). Skipping post-optimization. INFO [2023-09-08 12:13:39] Optimized purity: 0.34 INFO [2023-09-08 12:13:39] Fitting variants with betabin model for local optimum 3/9... INFO [2023-09-08 12:13:39] Fitting variants for purity 0.45, tumor ploidy 1.96 and contamination 0.01. INFO [2023-09-08 12:15:10] Poor goodness-of-fit (0.427). Skipping post-optimization. INFO [2023-09-08 12:15:10] Optimized purity: 0.45 INFO [2023-09-08 12:15:10] Fitting variants with betabin model for local optimum 4/9... INFO [2023-09-08 12:15:10] Fitting variants for purity 0.40, tumor ploidy 3.24 and contamination 0.01. INFO [2023-09-08 12:16:47] Poor goodness-of-fit (0.470). Skipping post-optimization. INFO [2023-09-08 12:16:47] Optimized purity: 0.40 INFO [2023-09-08 12:16:47] Fitting variants with betabin model for local optimum 5/9... INFO [2023-09-08 12:16:47] Fitting variants for purity 0.61, tumor ploidy 3.99 and contamination 0.01. INFO [2023-09-08 12:18:24] Poor goodness-of-fit (0.496). Skipping post-optimization. INFO [2023-09-08 12:18:24] Optimized purity: 0.61 INFO [2023-09-08 12:18:24] Fitting variants with betabin model for local optimum 6/9... INFO [2023-09-08 12:18:24] Fitting variants for purity 0.86, tumor ploidy 5.00 and contamination 0.01. INFO [2023-09-08 12:19:58] Poor goodness-of-fit (0.549). Skipping post-optimization. INFO [2023-09-08 12:19:58] Optimized purity: 0.86 INFO [2023-09-08 12:19:58] Fitting variants with betabin model for local optimum 9/9... INFO [2023-09-08 12:19:58] Fitting variants for purity 0.59, tumor ploidy 2.90 and contamination 0.01. INFO [2023-09-08 12:21:33] Poor goodness-of-fit (0.350). Skipping post-optimization. INFO [2023-09-08 12:21:33] Optimized purity: 0.59 INFO [2023-09-08 12:21:33] Initial guess of contamination rate: 0.050 INFO [2023-09-08 12:21:33] Optimizing contamination rate of optimum 1/5... INFO [2023-09-08 12:23:08] Optimized contamination rate: 0.056 INFO [2023-09-08 12:23:08] Initial guess of contamination rate: 0.051 INFO [2023-09-08 12:23:08] Optimizing contamination rate of optimum 2/5... INFO [2023-09-08 12:24:43] Optimized contamination rate: 0.057 INFO [2023-09-08 12:24:43] Initial guess of contamination rate: 0.033 INFO [2023-09-08 12:24:43] Optimizing contamination rate of optimum 3/5... INFO [2023-09-08 12:26:19] Optimized contamination rate: 0.042 INFO [2023-09-08 12:26:19] Initial guess of contamination rate: 0.048 INFO [2023-09-08 12:26:19] Optimizing contamination rate of optimum 4/5... INFO [2023-09-08 12:27:53] Optimized contamination rate: 0.054 INFO [2023-09-08 12:27:53] Initial guess of contamination rate: 0.017 INFO [2023-09-08 12:27:53] Optimizing contamination rate of optimum 5/5... INFO [2023-09-08 12:29:29] Optimized contamination rate: 0.040 INFO [2023-09-08 12:29:29] Done. INFO [2023-09-08 12:29:29] ------------------------------------------------------------ INFO [2023-09-08 12:29:34] Generating output files... WARN [2023-09-08 12:29:48] --intervals does not contain gene symbols. Not generating gene-level calls. B-allele frequency plot

ballelefreqplot

Session Info R version 4.2.3 (2023-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.2.3

I would also like to attach the aggregate results. As you can see, there are no failures, but many issues flagged. <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Sampleid | Purity | Ploidy | Sex | Contamination | Flagged | Failed | Curated | Comment -- | -- | -- | -- | -- | -- | -- | -- | -- OS11 | 0.45 | 1.964282 | M | 0.055606 | TRUE | FALSE | FALSE | POOR GOF (41.2%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS12 | 0.59 | 2.02072 | M | 0.078143 | TRUE | FALSE | FALSE | POOR GOF (57.4%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS13 | 0.59 | 1.964874 | M | 0.081329 | TRUE | FALSE | FALSE | POOR GOF (57.1%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS14 | 0.59 | 2.090946 | M | 0.081519 | TRUE | FALSE | FALSE | POOR GOF (55.5%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS15 | 0.82 | 2.536787 | M | 0.027567 | TRUE | FALSE | FALSE | POOR GOF (35.4%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS16 | 0.32 | 2.180455 | M | 0.082852 | TRUE | FALSE | FALSE | POOR GOF (68%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS31 | 0.42 | 1.997302 | M | 0.08345 | TRUE | FALSE | FALSE | POOR GOF (66.5%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS32 | 0.56 | 1.988732 | M | 0.085572 | TRUE | FALSE | FALSE | POOR GOF (63.3%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS33 | 0.86 | 1.990658 | M | 0.058915 | TRUE | FALSE | FALSE | POOR GOF (39%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS34 | 0.66 | 1.989603 | M | 0.088028 | TRUE | FALSE | FALSE | POOR GOF (58.8%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS35 | 0.87 | 1.997795 | M | 0.054135 | TRUE | FALSE | FALSE | POOR GOF (37.2%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS37 | 0.69 | 1.998294 | M | 0.084497 | TRUE | FALSE | FALSE | POOR GOF (58.6%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS71 | 0.44 | 1.980068 | M | 0.085305 | TRUE | FALSE | FALSE | POOR GOF (63.9%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS72 | 0.4 | 1.982376 | M | 0.080135 | TRUE | FALSE | FALSE | EXCESSIVE LOSSES;POOR GOF (64.5%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS73 | 0.29 | 1.978832 | M | 0.083472 | TRUE | FALSE | FALSE | LOW PURITY;POOR GOF (63.3%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS74 | 0.72 | 1.999794 | M | 0.065739 | TRUE | FALSE | FALSE | POOR GOF (52.1%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS75 | 0.56 | 1.999169 | M | 0.080766 | TRUE | FALSE | FALSE | POOR GOF (59.5%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION OS76 | 0.35 | 1.973285 | M | 0.089372 | TRUE | FALSE | FALSE | EXCESSIVE LOSSES;POOR GOF (63.8%);NOISY LOG-RATIO;HIGH AT- OR GC-DROPOUT;POTENTIAL SAMPLE CONTAMINATION

lima1 commented 1 year ago

Hi Alex, yes, looks like an issue with the PoN. Even a minor assay difference can result in poor results (like Agilent v4 vs v6). If you are not sure if it's exactly the same assay, I would generate per base coverage and display all samples in IGV. Most likely you see it's a different assay. The log ratio indeed should show no GC bias. Very low coverage in the normals should not cause this.

Base quality scores are not important if you do proper artifact filtering upstream. It's essentially just for determining if enough supporting reads are available to warrant inclusion of a variant in the fitting.

I'll fix the error message formatting.

Off-target is not necessarily helpful with whole-exome. If it's cleaner than on-target, it's fine to include. It just adds complexity, in doubt keep it simple, especially if < 10% of coverage information comes from off-target. Off-target is critical for smaller gene panel to fill the gaps between genes to get ploidy right. With WES, you don't have that issue.

The purity is clearly wrong for the plot you posted. Most likely because the coverage is super noisy, but there might be also an issue with the variant filtering.

alextrouerntrend commented 1 year ago

Thank you for your comments, I will address these as suggested and see if there is a change in the quality of the results.