nithishak commented 2 years ago

Describe the issue When I run PureCN on 10 tumor samples and use their paired 10 germline samples to create a reference, I get different cellularity and ploidy results from Sequenza. Moreover, I can only run the script if I exclude --off-targets parameter in Step 1 below. I am hoping to get some advice on whether my workflow is correct. I followed the steps from the vignette.

To Reproduce

Step 1 - Generating interval files

Rscript $PURECN/IntervalFile.R --in-file $BAIT_COORDINATES \
        --fasta $REF_GENOME_b37 --out-file $OUT_REF/baits_hg19_intervals.txt \
        --off-target --genome hg19 \
        --export $OUT_REF/baits_optimized_hg19.bed  \
        --mappability $MAPPABILITY_FILE_B37

Step 2 - Run mutect on 10 germline samples, create genomics db and use CreateSomaticPanelOfNormals

> $PURECN_G/G_sample_map

    count=1
        while IFS='' read -r line
        do
        if [ ! -z $GERMLINE_PFX ] && [ ! -f $PURECN_G/${GERMLINE_PFX}.vcf.gz ] && [ $count -le 10 ]; then
                count=$((count+1))
                #Run mutect on 10 germline samples
                gatk Mutect2 \
                -R $REF_GENOME_b37 \
                -I $SAVEROOT/$TUMOR_PFX/recal/bqsr_${GERMLINE_PFX}*.bam \
                --max-mnp-distance 0 \
                --genotype-germline-sites true \
                --genotype-pon-sites true \
                --interval-padding 75 \
                --L $BEDFILE_0B \
                -O $PURECN_G/${GERMLINE_PFX}.vcf.gz

                #Create a map file of all germline vcfs
                echo -e "$GERMLINE_PFX\t$PURECN_G/${GERMLINE_PFX}.vcf.gz" >> $PURECN_G/G_sample_map
        fi

       done<$LIST_OF_SAMPLES

        #Create a Genomics DB from the normal Mutect2 calls 
        gatk GenomicsDBImport \
        -R $REF_GENOME_b37 \
        -L $BEDFILE_0B \
        --genomicsdb-workspace-path $PURECN_G/pon_db \
        --sample-name-map $PURECN_G/G_sample_map

        #Combine the normal calls using CreateSomaticPanelOfNormals
        gatk CreateSomaticPanelOfNormals \
        -R $REF_GENOME_b37 \
        --germline-resource $GERMLINE_RESOURCE \
        -V gendb://$PURECN_G/pon_db \
        -O $PURECN_G/pon.vcf.gz

Step 3 - Run Mutect on unmatched tumor samples

while IFS='' read -r line
        do
                if [ ! -z $TUMOR_PFX ]  && [ ! -f $PURECN_T/${TUMOR_PFX}.vcf.gz ]; then
                        #Run Mutect2 for each tumor sample
                        gatk Mutect2 \
                        -R $REF_GENOME_b37 \
                        -I $SAVEROOT/$TUMOR_PFX/recal/bqsr_${TUMOR_PFX}*.bam \
                        -pon $PURECN_G/pon.vcf.gz \
                        --germline-resource $GERMLINE_RESOURCE\
                        --genotype-germline-sites true \
                        --genotype-pon-sites true \
                        --interval-padding 75 \
                        -L $BEDFILE_0B \
                        -O $PURECN_T/${TUMOR_PFX}.vcf.gz

                fi

       done<$LIST_OF_SAMPLES

Step 4 - calculate gc normalized coverages for germlines

    count=1
        while IFS='' read -r line
        do
        if [ ! -z $GERMLINE_PFX ] && [ $count -le 10 ] && [ ! -f $PURECN_G/${GERMLINE_PFX}_coverage_loess.txt.gz ] ; then
                count=$((count+1))

                Rscript $PURECN/Coverage.R \
                --out-dir $PURECN_G \
                --bam $SAVEROOT/$TUMOR_PFX/recal/bqsr_${GERMLINE_PFX}*.bam \
                --intervals $OUT_REF/baits_hg19_intervals.txt
        fi

       done<$LIST_OF_SAMPLES

Step 5 - calculate gc normalized coverages for tumors

    while IFS='' read -r line
        do
        if [ ! -z $TUMOR_PFX ] && [ ! -f $PURECN_T/${TUMOR_PFX}_coverage_loess.txt.gz ]; then

                Rscript $PURECN/Coverage.R \
                --out-dir $PURECN_T \
                --bam $SAVEROOT/$TUMOR_PFX/recal/bqsr_${TUMOR_PFX}*.bam \
                --intervals $OUT_REF/baits_hg19_intervals.txt
        fi

        done<$LIST_OF_SAMPLES

Step 6 - create normal db

    ls -a $PURECN_G/*_loess.txt.gz | cat > $PURECN_G/normal_coverages.list
        ulimit -n 5000
        #For a Mutect2/GATK4 normal panel GenomicsDB (beta)
        Rscript $PURECN/NormalDB.R --out-dir $PURECN_G \
        --coverage-files $PURECN_G/normal_coverages.list \
        --normal-panel $PURECN_G/pon_db \
        --genome hg19

Step 7 - calculate tumor cellularity and ploidy

while IFS='' read -r line
        do

                if [ ! -z $TUMOR_PFX ] && ! grep $TUMOR_PFX $PURECN_RESULTS/purecn_unpaired_tumor_cellularity_ploidy.txt; then

                        #For each tumor bam files
                        # Production pipeline run (dont need stats file if Mutect2 was run)
                        Rscript $PURECN/PureCN.R --out $PURECN_T \
                        --tumor $PURECN_T/bqsr_${TUMOR_PFX}_T_coverage_loess.txt.gz \
                        --sampleid $TUMOR_PFX \
                        --vcf $PURECN_T/${TUMOR_PFX}.vcf.gz \
                        --fun-segmentation PSCBS \
                        --normaldb $PURECN_G/normalDB_hg19.rds \
                        --mapping-bias-file $PURECN_G/mapping_bias_hg19.rds \
                        --intervals $OUT_REF/baits_hg19_intervals.txt \
                        --genome hg19 \
                        --model betabin \
                        --force --post-optimize --seed 123

Expected behavior I thought I might get the same cellularity and ploidy results as I did from Sequenza but the results are significantly different. I might need to alter the workflow as the “HIGH AT- OR GC-DROPOUT” flags for PureCN results don’t reflect what is seen in lab QC results (samples do not show high dropout). Also, the off target% for our targeted capture assay is around 30% and so I expect the --off-targets to be included in Step 1 but it gives an error as shown below.

Log file If I include --off-targets parameter in Step 1, I get this error in Step 6.

INFO [2022-07-12 03:14:00] Removing 15793 intervals with low coverage in normalDB.
FATAL [2022-07-12 03:14:00] No intervals left after coverage checks.
FATAL [2022-07-12 03:14:00]
FATAL [2022-07-12 03:14:00] This is most likely a user error due to invalid input data or
FATAL [2022-07-12 03:14:00] parameters (PureCN 2.2.0).
Error: No intervals left after coverage checks.
This is most likely a user error due to invalid input data or
parameters (PureCN 2.2.0).

Here is some information from the logs of one of the tumor sample from Step 7.

Execution halted
INFO [2022-07-24 19:10:31] Mean target coverages: 170X (tumor) 159X (normal).
INFO [2022-07-24 19:10:34] No off-target intervals. If this is hybrid-capture data, consider adding them.
INFO [2022-07-24 19:10:34] AT/GC dropout: 1.07 (tumor), 1.03 (normal), 1.05 (coverage log-ratio). 
INFO [2022-07-24 19:10:40] Initial testing for significant sample cross-contamination: maybe.
INFO [2022-07-24 19:10:41] 14.3% of targets contain variants.
INFO [2022-07-24 19:11:12] Mean standard deviation of log-ratios: 0.33 (MAPD: 0.23).
INFO [2022-07-24 19:13:13] Tumor/normal noise ratio: 11.108
WARN [2022-07-24 19:13:13] Extensive noise in tumor compared to normals.

lima1 commented 2 years ago

Hi Nithisha,

yes, something looks wrong here. But unfortunately nothing obvious.

Regarding off-target, simply load an example coverage file with --offtarget in R (PureCN::readCoverageFile). If there are no or not many reads, but you think there should be any, make sure the BAM file is ok and not filtered for on-target only. All PureCN is doing is filling the gaps between baits in IntervalFile.R with additional intervals. Coverage.R is then adding any reads in those gaps. Try both the coverage.txt.gz and the corresponding coverage_loess.txt.gz.

GC dropout: INFO [2022-07-24 19:10:34] AT/GC dropout: 1.07 (tumor), 1.03 (normal), 1.05 (coverage log-ratio). Might be indeed pointing to an issue. The important one is the log-ratio one, that one should be very close to 1.0, meaning that tumor and normal have identical GC bias.

Double check that there are no obvious tumor/normal swaps (unlikely, but worth checking). Then try using it without explicit GC-normalization. So instead of using coverage_loess.txt.gz, use the original coverage.txt.gz in both NormalDB.R and PureCN.R. That hopefully brings the log-ratio GC-bias to 1 (the other 2 are likely higher, but that's ok, the important part is that the log-ratio has no bias anymore).

INFO [2022-07-24 19:11:12] Mean standard deviation of log-ratios: 0.33 (MAPD: 0.23). That hopefully also gets down to something closer to 0.15 if those are young FFPE or fresh frozen samples.

If this did not fix it, can you post the B-allele frequency/coverage plots of both PureCN and Sequenza?

nithishak commented 2 years ago

Hello Dr. Riester,

Thank you. From your comments, I can see that my first error might have been that I am using a bam that has been filtered for on-target only.

I have run my workflow again and confirmed through using PureCN::readCoverageFile, that when I use --off-target, I see a lot of 0s for 23,851 positions. This is probably why my Step 7 also fails. When I do not use --off-target, I do not see as many 0s for 8056 positions. This runs to completion and produces consistent cellularity and ploidy estimates.

I would like to edit my workflow and use the unfiltered bam to see how the results differ but had a few follow-up questions.

The bam that is unfiltered has duplicate reads marked but not removed by Picard, is that alright?
In terms of making changes to the workflow, in Step 2 where I run Mutect2 on germline samples, should I exclude -L argument for bedfile?
Should the -L argument for bedfile also be removed from the GenomicsDBImport in Step 2?
I can also run cnvkit on these samples and I understand I can compare cnvkit outputs to purecn outputs for log ratios and that both should be similar. Could you let me know which file from cnvkit and which output from purecn should be compared? I would like to use this as an additional step to check that I am producing the right output.

Thank you!

lima1 commented 2 years ago

--off-target simply fills the gaps, so all the 8056 positions should be present in both. But if you filtered the BAM for intervals, the gaps are emtpy. So try the BAMs not filtered for intervals. And yes, just following best practices with duplicates marked. Removing them shouldn't be a big deal if you really want to or your pipeline does it automatically. Some UMI pipelines do that automatically for example.
Fine to use -L, but don't forget in that case to include the padding (--interval-padding). Depending on coverage, up to 200 makes sense. Not using the padding will give you only half the number of informative SNPs. Not providing -L will give you huge VCFs, with most of the variants not being informative.
Use exactly the same for GenomicsImport. That's what I used in my Mutect2 tests many months ago (should be still valid I hope):

gatk GenomicsDBImport -R /db/ngdx/references/hg38/Reference/PATCHED/sequence_u2af1_fix.v1.2020_04_01.fa -L files/gatk4_hm2_hg38.preprocessed.interval_list --interval-padding 200 --merge-input-intervals --genomicsdb-workspace-path files/gatk4_m2_hm2_pon_db -V sample_lists/normal_vcfs_gatk4_hm2_2022-01-10.list

I think the .cnr file contains the log ratio in a log2 column

Hope that helps, Markus

lima1 / PureCN

PureCN vs Sequenza results #251