griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
137 stars 59 forks source link

ref-transcript-mismatch-reporter does not work #1058

Closed xmy1990 closed 7 months ago

xmy1990 commented 7 months ago

Installation Type

Standalone

pVACtools Version / Docker Image

3.1.1

Python Version

No response

Operating System

No response

Describe the bug

hello, With ref-transcript-mismatch-reporter (vatools 5.1.0) on my test_vep.vcf as below:

ref-transcript-mismatch-reporter test_vep.vcf --filter hard --output-vcf test.vcf

The error still existed, the variant is as below: chr12 48238361 . G GCCTCAATGAGGAGCACTCCAAGCAGTACCGCTGCCTCTCCTTCCAGCC . clustered_events AS_FilterStatus=SITE;AS_SB_TABLE=101,4|1,6;DP=118;ECNT=5;GERMQ=93;MBQ=37,34;MFRL=52,194;MMQ=60,60;MPOS=49;NALOD=1.48;NLOD=8.75;POPAF=6;TLOD=19.3;CSQ=CCTCAATGAGGAGCACTCCAAGCAGTACCGCTGCCTCTCCTTCCAGCC|stop_gained&protein_altering_variant|HIGH|VDR|7421|Transcript|NM_001364085.1|protein_coding|10/10||NM_001364085.1:c.1451_1452insGGCTGGAAGGAGAGGCAGCGGTACTGCTTGGAGTGCTCCTCATTGAGG|NP_001351014.1:p.Asn484delinsLysAlaGlyArgArgGlySerGlyThrAlaTrpSerAlaProHisTer|1611-1612|1451-1452|484|N/KAGRRGSGTAWSAPH*G|aac/aaGGCTGGAAGGAGAGGCAGCGGTACTGCTTGGAGTGCTCCTCATTGAGGc|||-1||EntrezGene|||rseq_mrna_nonmatch&rseq_5p_mismatch||||OK|||||||||||||||MEAMAASTSLPDPGDFDRNVPRICGVCGDRATGFHFNAMTCEGCKGFFRRSMKRKALFTCPFNGDCRITKDNRRHCQACRLKRCVDIGMMKEFILTDEEVQRKREMILKRKEEEALKDSLRPKLSEEQQRIIAILLDAHHKTYDPTYSDFCQFRPPVRVNDGGGSHPSRPNSRHTPSFSGDSSSSCSDHCITSSDMMDSSSFSNLDLSEEDSDDPSVTLELSQLSMLPHLADLVSYSIQKVIGFAKMIPGFRDLTSEDQIVLLKSSAIEVIMLRSNESFTMDDMSWTCGNQDYKYRVSDVTKAGHSLELIEPLIKFQVGLKKLNLHEEEHVLLMAICIVSPDRPGVQDAALIEAIQDRLSNTLQTYIRCRHPPPGSHLLYAKMIQKLADLRSLNEEHSKQYRCLSFQPECSMKLTPLVLEVFGNEISLGQPVAVPGWGCSSRATCQARGWRLLSSPPHPVWGSAPPLPPPLSTQPILSPVQPNPFPAGFSPVP GT:AD:AF:DP:F1R2:F2R1:SB 0/0:29,0:0.0318:29:17,0:12,0:29,0,0,0 0/1:76,7:0.0936:83:57,1:19,0:72,4,1,6

It has not been filtered, please help.

Thanks!

How to reproduce this bug

**ref-transcript-mismatch-reporter test_vep.vcf --filter hard --output-vcf test.vcf**

Input files

No response

Log output

ERROR: There was a mismatch between the actual wildtype amino acid sequence (P) and the expected amino acid sequence (N). Did you use the same reference build version for VEP that you used for creating the VCF? OrderedDict([('chromosome_name', 'chr12'), ('start', '48238361'),

Output files

No response

susannasiebert commented 7 months ago

@xmy1990 thank you for your interest in pVACtools. I'm happy to investigate this issue you're encountering. Can you please attach the problematic variant entry as a VCF file? Because VCF headers vary between files I can't debug this issue without having a proper VCF file. Particularly the VEP CSQ header changes depending on how you ran VEP and I need the particular VEP header matching the VEP CSQ annotation field in this variant. I also need all of the metadata headers for the different FORMAT fields in order to parse the VCF entry correctly.

xmy1990 commented 7 months ago

Thanks a lot @https://github.com/susannasiebert The problematic variant entry as a VCF file was attached: Thanks!

xmy1990 commented 7 months ago
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=FAIL,Description="Fail the site if all alleles fail but for different reasons.">
##FILTER=<ID=base_qual,Description="alt median base quality">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=contamination,Description="contamination">
##FILTER=<ID=duplicate,Description="evidence for alt allele is overrepresented by apparent duplicates">
##FILTER=<ID=fragment,Description="abs(ref - alt) median fragment length">
##FILTER=<ID=germline,Description="Evidence indicates this site is germline, not somatic">
##FILTER=<ID=haplotype,Description="Variant near filtered variant on same haplotype.">
##FILTER=<ID=low_allele_frac,Description="Allele fraction is below specified threshold">
##FILTER=<ID=map_qual,Description="ref - alt median mapping quality">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
##FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">
##FILTER=<ID=normal_artifact,Description="artifact_in_normal">
##FILTER=<ID=orientation,Description="Orientation bias detected by the orientation bias mixture model">
##FILTER=<ID=panel_of_normals,Description="Blacklisted site in panel of normals">
##FILTER=<ID=position,Description="median distance of alt variants from end of reads">
##FILTER=<ID=slippage,Description="Variant near filtered variant on same haplotype.">
##FILTER=<ID=strand_bias,Description="Evidence for alt allele comes from one read direction only">
##FILTER=<ID=strict_strand,Description="Evidence for alt allele is not represented in both directions">
##FILTER=<ID=weak_evidence,Description="Mutation does not meet likelihood threshold">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##INFO=<ID=AS_FilterStatus,Number=1,Type=String,Description="Filter status for each allele, as assessed by ApplyRecalibration. Note that the VCF filter field will reflect the most lenient/sensitive status across all alleles.">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ECNT,Number=1,Type=Integer,Description="Number of events in this haplotype">
##INFO=<ID=GERMQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not germline variants">
##INFO=<ID=MBQ,Number=R,Type=Integer,Description="median base quality">
##INFO=<ID=MFRL,Number=R,Type=Integer,Description="median fragment length">
##INFO=<ID=MMQ,Number=R,Type=Integer,Description="median mapping quality">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="median distance from end of read">
##INFO=<ID=NALOD,Number=A,Type=Float,Description="Negative log 10 odds of artifact in normal with same allele fraction as tumor">
##INFO=<ID=NLOD,Number=A,Type=Float,Description="Normal log 10 likelihood ratio of diploid het or hom alt genotypes">
##INFO=<ID=PON,Number=0,Type=Flag,Description="site found in panel of normals">
##INFO=<ID=POPAF,Number=A,Type=Float,Description="negative log 10 population allele frequencies of alt alleles">
##INFO=<ID=ROQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to read orientation artifact">
##INFO=<ID=RPA,Number=R,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##INFO=<ID=STRQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles in STRs are not polymerase slippage errors">
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log 10 likelihood ratio score of variant existing versus not existing">
##SentieonCommandLine.TNfilter=<ID=TNfilter,Version="sentieon-genomics-202112.05",Date="2024-01-31T08:17:32Z",CommandLine="/sga_dev/zb-liaowanjun/sentieon-genomics-202112.05/libexec/driver -r /data2/data_share/pzx/reference/hs37d5/hs37d5.fa --algo TNfilter --tumor_sample T-4032 --normal_sample PB-4032 -v /sga_dev/zb-liaowanjun/sample36_joint/T-4032_PB-4032/matched_tmp/T-4032_jointTMP.vcf /sga_dev/zb-liaowanjun/sample36_joint/T-4032_PB-4032/matched_tmp/T-4032_jointunfiltered.vcf">
##SentieonCommandLine.TNhaplotyper2=<ID=TNhaplotyper2,Version="sentieon-genomics-202112.05",Date="2024-01-31T06:48:28Z",CommandLine="/sga_dev/zb-liaowanjun/sentieon-genomics-202112.05/libexec/driver -t 15 -r /data2/data_share/pzx/reference/hs37d5/hs37d5.fa -i /data2/dev_projects/xmy/TNB/validation_data/test1/PRJNA298330/T-4032/realigned/T-4032_final.bam -i /data2/dev_projects/xmy/TNB/validation_data/test1/PRJNA298330/PB-4032/realigned/PB-4032_final.bam --interval /sga_dev/panel_validation/V710_panel/bed/sort_KST700_v3_pd100_merged.bed --algo TNhaplotyper2 --call_germline_sites --min_init_tumor_lod 0 --min_tumor_lod 0.5 --prune_factor -1 --min_normal_lod 0 --tumor_sample T-4032 --normal_sample PB-4032 /sga_dev/zb-liaowanjun/sample36_joint/T-4032_PB-4032/matched_tmp/T-4032_jointTMP.vcf">
##contig=<ID=chr1,length=249250621,assembly=b37>
##contig=<ID=chr2,length=243199373,assembly=b37>
##contig=<ID=chr3,length=198022430,assembly=b37>
##contig=<ID=chr4,length=191154276,assembly=b37>
##contig=<ID=chr5,length=180915260,assembly=b37>
##contig=<ID=chr6,length=171115067,assembly=b37>
##contig=<ID=chr7,length=159138663,assembly=b37>
##contig=<ID=chr8,length=146364022,assembly=b37>
##contig=<ID=chr9,length=141213431,assembly=b37>
##contig=<ID=chr10,length=135534747,assembly=b37>
##contig=<ID=chr11,length=135006516,assembly=b37>
##contig=<ID=chr12,length=133851895,assembly=b37>
##contig=<ID=chr13,length=115169878,assembly=b37>
##contig=<ID=chr14,length=107349540,assembly=b37>
##contig=<ID=chr15,length=102531392,assembly=b37>
##contig=<ID=chr16,length=90354753,assembly=b37>
##contig=<ID=chr17,length=81195210,assembly=b37>
##contig=<ID=chr18,length=78077248,assembly=b37>
##contig=<ID=chr19,length=59128983,assembly=b37>
##contig=<ID=chr20,length=63025520,assembly=b37>
##contig=<ID=chr21,length=48129895,assembly=b37>
##contig=<ID=chr22,length=51304566,assembly=b37>
##contig=<ID=chrX,length=155270560,assembly=b37>
##contig=<ID=chrY,length=59373566,assembly=b37>
##contig=<ID=chrM,length=16569,assembly=b37>
##contig=<ID=GL000207.1,length=4262,assembly=b37>
##contig=<ID=GL000226.1,length=15008,assembly=b37>
##contig=<ID=GL000229.1,length=19913,assembly=b37>
##contig=<ID=GL000231.1,length=27386,assembly=b37>
##contig=<ID=GL000210.1,length=27682,assembly=b37>
##contig=<ID=GL000239.1,length=33824,assembly=b37>
##contig=<ID=GL000235.1,length=34474,assembly=b37>
##contig=<ID=GL000201.1,length=36148,assembly=b37>
##contig=<ID=GL000247.1,length=36422,assembly=b37>
##contig=<ID=GL000245.1,length=36651,assembly=b37>
##contig=<ID=GL000197.1,length=37175,assembly=b37>
##contig=<ID=GL000203.1,length=37498,assembly=b37>
##contig=<ID=GL000246.1,length=38154,assembly=b37>
##contig=<ID=GL000249.1,length=38502,assembly=b37>
##contig=<ID=GL000196.1,length=38914,assembly=b37>
##contig=<ID=GL000248.1,length=39786,assembly=b37>
##contig=<ID=GL000244.1,length=39929,assembly=b37>
##contig=<ID=GL000238.1,length=39939,assembly=b37>
##contig=<ID=GL000202.1,length=40103,assembly=b37>
##contig=<ID=GL000234.1,length=40531,assembly=b37>
##contig=<ID=GL000232.1,length=40652,assembly=b37>
##contig=<ID=GL000206.1,length=41001,assembly=b37>
##contig=<ID=GL000240.1,length=41933,assembly=b37>
##contig=<ID=GL000236.1,length=41934,assembly=b37>
##contig=<ID=GL000241.1,length=42152,assembly=b37>
##contig=<ID=GL000243.1,length=43341,assembly=b37>
##contig=<ID=GL000242.1,length=43523,assembly=b37>
##contig=<ID=GL000230.1,length=43691,assembly=b37>
##contig=<ID=GL000237.1,length=45867,assembly=b37>
##contig=<ID=GL000233.1,length=45941,assembly=b37>
##contig=<ID=GL000204.1,length=81310,assembly=b37>
##contig=<ID=GL000198.1,length=90085,assembly=b37>
##contig=<ID=GL000208.1,length=92689,assembly=b37>
##contig=<ID=GL000191.1,length=106433,assembly=b37>
##contig=<ID=GL000227.1,length=128374,assembly=b37>
##contig=<ID=GL000228.1,length=129120,assembly=b37>
##contig=<ID=GL000214.1,length=137718,assembly=b37>
##contig=<ID=GL000221.1,length=155397,assembly=b37>
##contig=<ID=GL000209.1,length=159169,assembly=b37>
##contig=<ID=GL000218.1,length=161147,assembly=b37>
##contig=<ID=GL000220.1,length=161802,assembly=b37>
##contig=<ID=GL000213.1,length=164239,assembly=b37>
##contig=<ID=GL000211.1,length=166566,assembly=b37>
##contig=<ID=GL000199.1,length=169874,assembly=b37>
##contig=<ID=GL000217.1,length=172149,assembly=b37>
##contig=<ID=GL000216.1,length=172294,assembly=b37>
##contig=<ID=GL000215.1,length=172545,assembly=b37>
##contig=<ID=GL000205.1,length=174588,assembly=b37>
##contig=<ID=GL000219.1,length=179198,assembly=b37>
##contig=<ID=GL000224.1,length=179693,assembly=b37>
##contig=<ID=GL000223.1,length=180455,assembly=b37>
##contig=<ID=GL000195.1,length=182896,assembly=b37>
##contig=<ID=GL000212.1,length=186858,assembly=b37>
##contig=<ID=GL000222.1,length=186861,assembly=b37>
##contig=<ID=GL000200.1,length=187035,assembly=b37>
##contig=<ID=GL000193.1,length=189789,assembly=b37>
##contig=<ID=GL000194.1,length=191469,assembly=b37>
##contig=<ID=GL000225.1,length=211173,assembly=b37>
##contig=<ID=GL000192.1,length=547496,assembly=b37>
##contig=<ID=NC_007605,length=171823,assembly=b37>
##contig=<ID=hs37d5,length=35477943,assembly=b37>
##reference=/xx/hs37d5.fa
##tumor_sample=Tumor-666
##normal_sample=Normal-666
##bcftools_filterVersion=1.11+htslib-1.11
#VEP="v103" time="2024-02-01 13:52:04" cache=/xx/homo_sapiens_refseq/103_GRCh37" ensembl-variation=103.06320c4 ensembl=103.4c8d44a ensembl-io=103.353f93a ensembl-funcgen=103.b53bef4 1000genomes="phase3" COSMIC="90" ClinVar="201912" ESP="20141103" HGMD-PUBLIC="20194" assembly="GRCh37.p13" dbSNP="153" gencode="GENCODE 19" genebuild="2011-04" gnomAD="r2.1" polyphen="2.2.2" refseq="2019-10-24 23:10:14 - GCF_000001405.25_GRCh37.p13_genomic.gff" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|TSL|REFSEQ_MATCH|REFSEQ_OFFSET|GIVEN_REF|USED_REF|BAM_EDIT|HGVS_OFFSET|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|CLIN_SIG|SOMATIC|PHENO|FrameshiftSequence|WildtypeProtein">
##FrameshiftSequence=Predicted sequence for frameshift mutations
##WildtypeProtein=The normal, non-mutated protein sequence
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Normal-666  Tumor-666
chr12   48238361    .   G   GCCTCAATGAGGAGCACTCCAAGCAGTACCGCTGCCTCTCCTTCCAGCC   .   clustered_events    AS_FilterStatus=SITE;AS_SB_TABLE=101,4|1,6;DP=118;ECNT=5;GERMQ=93;MBQ=37,34;MFRL=52,194;MMQ=60,60;MPOS=49;NALOD=1.48;NLOD=8.75;POPAF=6;TLOD=19.3;CSQ=CCTCAATGAGGAGCACTCCAAGCAGTACCGCTGCCTCTCCTTCCAGCC|stop_gained&protein_altering_variant|HIGH|VDR|7421|Transcript|NM_001364085.1|protein_coding|10/10||NM_001364085.1:c.1451_1452insGGCTGGAAGGAGAGGCAGCGGTACTGCTTGGAGTGCTCCTCATTGAGG|NP_001351014.1:p.Asn484delinsLysAlaGlyArgArgGlySerGlyThrAlaTrpSerAlaProHisTer|1611-1612|1451-1452|484|N/KAGRRGSGTAWSAPH*G|aac/aaGGCTGGAAGGAGAGGCAGCGGTACTGCTTGGAGTGCTCCTCATTGAGGc|||-1||EntrezGene|||rseq_mrna_nonmatch&rseq_5p_mismatch||||OK|||||||||||||||MEAMAASTSLPDPGDFDRNVPRICGVCGDRATGFHFNAMTCEGCKGFFRRSMKRKALFTCPFNGDCRITKDNRRHCQACRLKRCVDIGMMKEFILTDEEVQRKREMILKRKEEEALKDSLRPKLSEEQQRIIAILLDAHHKTYDPTYSDFCQFRPPVRVNDGGGSHPSRPNSRHTPSFSGDSSSSCSDHCITSSDMMDSSSFSNLDLSEEDSDDPSVTLELSQLSMLPHLADLVSYSIQKVIGFAKMIPGFRDLTSEDQIVLLKSSAIEVIMLRSNESFTMDDMSWTCGNQDYKYRVSDVTKAGHSLELIEPLIKFQVGLKKLNLHEEEHVLLMAICIVSPDRPGVQDAALIEAIQDRLSNTLQTYIRCRHPPPGSHLLYAKMIQKLADLRSLNEEHSKQYRCLSFQPECSMKLTPLVLEVFGNEISLGQPVAVPGWGCSSRATCQARGWRLLSSPPHPVWGSAPPLPPPLSTQPILSPVQPNPFPAGFSPVP   GT:AD:AF:DP:F1R2:F2R1:SB    0/0:29,0:0.0318:29:17,0:12,0:29,0,0,0   0/1:76,7:0.0936:83:57,1:19,0:72,4,1,6
xmy1990 commented 7 months ago

hello,@https://github.com/susannasiebert Is there any progress on the issue? Thanks

susannasiebert commented 7 months ago

My apologies, I only replied to your issue in the vatools repository. This issue should be fixed in VAtools 5.1.1. Using that version this variant, and others like it, should now be filtered out.

xmy1990 commented 7 months ago

Thanks for the quick response. I filtered it with https://github.com/griffithlab/VAtools/issues/74#issuecomment-1931105231 Thanks!