lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
128 stars 32 forks source link

Mutect2 Matched samples purity estimates fails #120

Closed npatel-ah closed 1 year ago

npatel-ah commented 4 years ago

Hello Team,

I am trying to use PureCN with a matched tumor and normal samples, however, it's throwing an error. I am attaching the log file here. I have used CNVkit for segmentation and Mutect2 (GATK 4.1.7.0) for variant detection. Tumor01.PureCN.log

I tried a few things to troubleshoot, none seems to be working.

  1. First, I ran it without the "dbinfoflag" flag but it produced below error. Note that I did provide the "popafinfofield" flag.

    WARN [2020-05-12 15:28:37] vcf.file has no DB info field for membership in germline databases. Found and used somatic status instead.
    INFO [2020-05-12 15:28:38] 0 (0.0%) variants annotated as likely germline (DB INFO flag).
  2. So then I annotated VCF file with common SNPs (1kg AF > 1%) and used the COMMON as db flag and then got another error as shown in the log file.

  3. Also, I tried different option for fun segmentation (PBCBS isn't working)

  4. And Changed thresholds for --minpurity 0.05 and --maxpurity 0.99.

Rscript $PURECN_DIR/PureCN.R 
        --out PureCN_Analysis/$TUMOR_ID \
    --sampleid $TUMOR_ID \
    --tumor TumorCNN/$TUMOR_ID.cnr \
    --segfile TumorCNN/$TUMOR_ID.seg \
    --vcf ${TUMOR_ID}.filtered.anno.vcf \
    --snpblacklist hg19_simpleRepeats.bed \
    --genome hg19 \
    --popafinfofield gnomAD_exomes_controls_AF \
    --dbinfoflag COMMON \
    --force --postoptimize --seed 123 \
    --outvcf \
    --cores 12 \
    --funsegmentation Hclust \
    --parallel >& $TUMOR_ID.PureCN.log

Any suggestion to resolve the error is appreciated. If there is an issue with samples itself as stated in log file, what would be an alternative way to confirm the behavior?

A note, I don't have normal samples to run NormalDB.

Best, nihir

lima1 commented 4 years ago

Looks like you filter out known SNPs. We need the SNPs for allele-specific copy number calling (and PSCBS requires them for annotation). M2 requires the -genotype-germline-sites flag to call germline variants.

There is also something wrong with your off-target reads. For WES, they don't add much. If you cannot figure out why the log-ratio standard deviation is so high, you probably get better results without those.

npatel-ah commented 4 years ago

Thank for your quick reply. I indeed use the "-genotype-germline-sites" to detect germline variants with Mutect2 and assuming the downstream filtering of M2 only assigns a flag to the FILTER field and not remove actual variants. In terms of common variants, I do see them VCF, in fact below line from log confirms it, as the COMMON flag assigned to AF > 1% in 1000g.

48496 (69.6%) variants annotated as likely germline (COMMON INFO flag).

How would I run analysis without using "log-ratio standard deviation ". Is "--undosd" is an appropriate choice here?

just a note I am using xGen IDT BAIT file for all the analyses.

Best, Nihir

lima1 commented 4 years ago

I would re-run CNVkit without the off-target feature. The lines log-ratio standard deviation display the noise of the tumor vs normal coverage log2-ratio. In your case, the off-target ratios are super noisy, likely the reason you get the crash. You can also try running our normalization instead of CNVkit.

Not sure I understand your comment you don't have normal samples to build the NormalDB. You have the normals to run Mutect?

npatel-ah commented 4 years ago

Thanks again, understood now I will try PureCN normalization. I have a tumor and a matched normal sample. M2 was ran with the Pair. But if understand correctly, for NormalDB it requires some minimum number of samples.

npatel-ah commented 4 years ago

Hello Markus, Thanks for all of your suggestions. Before giving up on CNVkit I have tested it with one of the pair of Tumor and Normal bam files from a panel sequencing as well updated PureCN to 1.18.

This time the SD for log-ratios seems OK however, after running almost till end I got below error, for which I can't find any related post.

One thing I came across was the SOMATIC flag from issue #108 as well noticed in this tutorial section 2.1

For M2 do I need to run VariantAnnotation to add SOMATIC flag ? and is that what is causing all the issues? Attached here the log file. PureCN_V1-18_Panel.log

Error: BiocParallel errors
  element index: 1, 2, 3, 4, 5, 6, ...
  first error: missing value where TRUE/FALSE needed
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
Execution halted

Thanks again for your time, Nihir

lima1 commented 4 years ago

Hi Nihir,

looks like there is an issue where the mapping bias is NaN. This is likely causing the crash. I'll investigate, thanks for reporting. You should not need the SOMATIC flag for M2, it should automatically generate it.

The log file looks better now, indeed.

Markus

lima1 commented 4 years ago

Can you try again with the M2 VCF as it is generated, i.e. without adding the COMMON flag etc.?

npatel-ah commented 4 years ago

Hello Markus,

Here's the error. When I don't annotate with DB it seems to predict all variants as Somatic. How is it identifying if the variant is somatic or nor ? I don't see any mention of the somatic flag on recent M2 documentation.

INFO [2020-05-18 13:46:04] Using BiocParallel for parallel optimization.
INFO [2020-05-18 13:46:04] Loading coverage files...
INFO [2020-05-18 13:46:05] Found log2-ratio in tumor coverage data.
WARN [2020-05-18 13:46:05] Allosome coverage missing, cannot determine sex.
WARN [2020-05-18 13:46:05] Allosome coverage missing, cannot determine sex.
INFO [2020-05-18 13:46:06] Using 27582 intervals (10611 on-target, 16971 off-target).
INFO [2020-05-18 13:46:06] Ratio of mean on-target vs. off-target read counts: NaN
INFO [2020-05-18 13:46:06] Mean off-target bin size: 146326
INFO [2020-05-18 13:46:06] Loading VCF...
INFO [2020-05-18 13:46:06] Found 4343 variants in VCF file.
INFO [2020-05-18 13:46:06] Removing 464 triallelic sites.
WARN [2020-05-18 13:46:06] vcf.file has no DB info field for membership in germline databases. Found and used somatic status instead.
INFO [2020-05-18 13:46:06] 0 (0.0%) variants annotated as likely germline (DB INFO flag).
FATAL [2020-05-18 13:46:06] VCF either contains no germline variants or variants are not properly

FATAL [2020-05-18 13:46:06] annotated.

FATAL [2020-05-18 13:46:06]

FATAL [2020-05-18 13:46:06] This is most likely a user error due to invalid input data or

FATAL [2020-05-18 13:46:06] parameters (PureCN 1.18.0).

Error: VCF either contains no germline variants or variants are not properly
annotated.

This is most likely a user error due to invalid input data or
parameters (PureCN 1.18.0).
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
Execution halted

Thanks for your continuous support on the issues.

Best. Nihir

lima1 commented 4 years ago

Did you run Mutect with -genotype-germline-sites? Otherwise it will only call somatic.

npatel-ah commented 4 years ago

Yes, I enabled the flag. Here is my command line for M2

    $GATK Mutect2 -R $GENOME \
    -L $TartgetInterval \
    -ip 50 \
    -I $TUMOR \
    -I $NORMAL \
    -tumor $TUMOR_ID \
    -normal $NORMAL_ID \
    --genotype-germline-sites true \
    --genotype-pon-sites true \
    -germline-resource $gnomAD \
    --native-pair-hmm-threads 24 \
    --native-pair-hmm-use-double-precision true \
    --disable-sequence-dictionary-validation \
    --f1r2-tar-gz $TUMOR_ID.f1r2.tar.gz \
    -O $TUMOR_ID.unfiltered.vcf.gz
lima1 commented 4 years ago

Hmmm. 4000 variants is a weird number if this is whole exome. It's too high for only somatic, but far too low if it includes germline. Do you see something weird when you look at the P_GERMLINE and POP_AF fields? I.e. does it cover the whole expected range for 0 to 0.5-1?

npatel-ah commented 4 years ago

Sorry if I didn't clarify, the last data is from Panel sequencing of about 400 genes. For my WES run, I was getting variants in a range of 70k to 80K. But I think you have identified the culprit and it seems to be the "POPAF", which is always > 1 and because it's -log10 of POPAF. Also, I believe "P_GERMLINE" no longer exits or is replaced with "GERMQ". Here is the header

##INFO=<ID=CONTQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to contamination">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ECNT,Number=1,Type=Integer,Description="Number of events in this haplotype">
##INFO=<ID=GERMQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not germline variants">
##INFO=<ID=MBQ,Number=R,Type=Integer,Description="median base quality">
##INFO=<ID=MFRL,Number=R,Type=Integer,Description="median fragment length">
##INFO=<ID=MMQ,Number=R,Type=Integer,Description="median mapping quality">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="median distance from end of read">
##INFO=<ID=NALOD,Number=A,Type=Float,Description="Negative log 10 odds of artifact in normal with same allele fraction as tumor">
##INFO=<ID=NCount,Number=1,Type=Integer,Description="Count of N bases in the pileup">
##INFO=<ID=NLOD,Number=A,Type=Float,Description="Normal log 10 likelihood ratio of diploid het or hom alt genotypes">
##INFO=<ID=PON,Number=0,Type=Flag,Description="site found in panel of normals">

##INFO=<ID=POPAF,Number=A,Type=Float,Description="negative log 10 population allele frequencies of alt alleles">

##INFO=<ID=ROQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to read orientation artifact">
##INFO=<ID=RPA,Number=R,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=SEQQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not sequencing errors">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##INFO=<ID=STRANDQ,Number=1,Type=Integer,Description="Phred-scaled quality of strand bias artifact">
##INFO=<ID=STRQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles in STRs are not polymerase slippage errors">
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log 10 likelihood ratio score of variant existing versus not existing">
lima1 commented 4 years ago

Oh, ok. I'll add code to check that POPAF is on log-scale and support for GERMQ. Thanks for reporting!

lima1 commented 4 years ago

Hi Nihir, have a look when you have a chance. I don't have matched tumor/normal test data readily available right now, but the latest commit should work.

npatel-ah commented 4 years ago

Hello Markus,

Wow! That was fast. Thank you very much for the quick response. I just tested the code from the latest commit and worked perfectly fine without any additional flags (i.e dbinfoflag or popinfoflag). It can successfully detect the somatic mutation and runs without any error.

Seems like no additional annotations are required for Matched samples ran with latest M2.

Here are the lines from log file just for information

INFO [2020-05-18 18:04:34] Loading VCF...
INFO [2020-05-18 18:04:35] Found 4343 variants in VCF file.
INFO [2020-05-18 18:04:35] Removing 464 triallelic sites.
WARN [2020-05-18 18:04:35] Found GERMQ info field with Phred scaled germline probabilities.
WARN [2020-05-18 18:04:35] vcf.file has no DB info field for membership in germline databases. Found and used somatic status instead.
INFO [2020-05-18 18:04:35] 2623 (67.6%) variants annotated as likely germline (DB INFO flag).
INFO [2020-05-18 18:04:35] Tumor_O-170000133 is tumor in VCF file.
INFO [2020-05-18 18:04:35] 28 homozygous and 3 heterozygous variants on chrX.
INFO [2020-05-18 18:04:35] Sex from VCF: M (Fisher's p-value: < 0.0001, odds-ratio: 15.03).
INFO [2020-05-18 18:04:36] Removing 1358 non heterozygous (in matched normal) germline SNPs.
INFO [2020-05-18 18:04:36] Removing 879 variants with AF < 0.030 or AF >= 1.000 or less than 6 supporting reads or depth < 15.
INFO [2020-05-18 18:04:39] Removing 48 blacklisted variants.
INFO [2020-05-18 18:04:39] Removing 0 low quality variants with BQ < 25.
INFO [2020-05-18 18:04:39] Total size of targeted genomic region: 2.22Mb (2.79Mb with 50bp padding).
INFO [2020-05-18 18:04:39] 13.8% of targets contain variants.
INFO [2020-05-18 18:04:39] Removing 135 variants outside intervals.
INFO [2020-05-18 18:04:39] Found SOMATIC annotation in VCF.
INFO [2020-05-18 18:04:39] Setting somatic prior probabilities for somatic variants to 0.999000 or to 0.000100 otherwise.
INFO [2020-05-18 18:04:39] Found SOMATIC annotation in VCF. Setting mapping bias to 0.994.
INFO [2020-05-18 18:04:39] Excluding 285 novel or poor quality variants from segmentation.
INFO [2020-05-18 18:04:39] Sample sex: ?
INFO [2020-05-18 18:04:39] Segmenting data...
INFO [2020-05-18 18:04:39] Loaded provided segmentation file Tumor_1 seg (format DNAcopy).
INFO [2020-05-18 18:04:39] Re-centering provided segment means (offset -0.0523).
INFO [2020-05-18 18:04:39] Setting prune.hclust.h parameter to 0.150000.
INFO [2020-05-18 18:04:39] Found 56 segments with median size of 35.05Mb.
INFO [2020-05-18 18:04:39] Removing 7 variants outside segments.
INFO [2020-05-18 18:04:39] Using 1452 variants.
INFO [2020-05-18 18:04:39] Mean standard deviation of log-ratios: 0.30
INFO [2020-05-18 18:04:39] Mean standard deviation of on-target log-ratios only: 0.24
INFO [2020-05-18 18:04:39] Mean standard deviation of off-target log-ratios only: 0.32
INFO [2020-05-18 18:04:39] 2D-grid search of purity and ploidy...

Best, Nihir

lima1 commented 4 years ago

Great. Closing now. Have a look at the Sampleid.pdf output, the SNPs should cleanly follow the provided segmentations. Also check the Sampleid_variants.csv file (or vcf if you provided --vcf). There might be improvements possible, for example variants that should be filtered based on their flags. Please open new issues with those suggestions.

lima1 commented 4 years ago

I just found a bug that essentially turned off all flag filtering for M2 VCFs.

npatel-ah commented 4 years ago

No worries, thanks for the heads up. Let me know whenever you have an update. Meanwhile, I will take a closer look at the results as you suggested.

lima1 commented 4 years ago

I pushed a commit already. It should remove most variants not labeled as PASS, except germline and PoN.

dakyung99 commented 1 year ago

Hi, Regarding this issue, should I provide variants with PASS labeled after Mutect2 filter run? or should unfiltered raw Mutect2 output work for purecn input?

Thanks a lot! DK

lima1 commented 1 year ago

Just follow the somatic best practices as closely as possible: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132 . So yes, include all filtering steps. To turn the GenomicsDB directory into a PureCN mapping bias RDS file, you can use the Docker image which comes with the genomicsdb R package.

If you have matched normals, be sure to provide --genotype-germline-sites true to not remove high quality germline SNPs. Also see #320.

dakyung99 commented 1 year ago

@lima1 Thank you so much for prompt reply, I've ve got one more question regarding germline variants VCFs- should I input germline called from matched normal?(e.g.) blood sample) or input germline mutations calls from tumour sample?

Again, Thanks for providing this wonderful tool!

lima1 commented 1 year ago

You just provide the VCF as generated by Mutect2 with the --genotype-germline-sites. It needs the tumor SNP allelic fractions, but will use the normal allelic fractions for bias estimation when available.