ERROR: There was a mismatch between the actual wildtype amino acid sequence (Q) and the expected amino acid sequence (L). Did you use the same reference build version for VEP that you used for creati ng the VCF?

jackyenbioinfo commented 5 years ago

pvactools version: 1.3.3
Python version: 3.5
Operating System: linux

Describe the bug A clear and concise description of what the bug is.

I use a different reference genome for alignment and getting an error message like described above. To Reproduce What is the pVACtools command you ran? Please also attach any input files that can be used to reproduce this problem.

/home/lee/NGS/anaconda/envs/py35/bin/pvacseq run /media/NGS_data/pvactools/CS436935_S5_bam_readcount_expression_annotated.vcf Test HLA-A32:01,HLA-A03:01,HLA-B15:01,HLA-B14:02,HLA-C03:03,HLA-C08:02,DRB1*1927.0 MHCflurry MHCnuggetsI MHCnuggetsII NNalign NetMHC PickPocket SMM SMMPMBEC SMMalign /media/NGS_data/pvactools/CS436935_S5 -e 8,9

Log Output

The log that gets written to standard out

Output File If your bug is related to the contents of an output file, please paste the output file here.

Expected behavior A clear and concise description of what you expected to happen.

susannasiebert commented 5 years ago

As the error message indicated, since you used a different reference genome between alignment and VEP annotation, the VEP annotation is incorrect. The predict protein change and transcript protein sequence do not match up. This error is expected and there for the purpose of preventing incorrect epitope prediction. You need to use the same reference for VEP annotation as for alignment or realign your sample against the reference you want to use for annotation.

You can also remove the VCF entries that are causing this error although we can't guarantee that your results will be correct even if your pVACseq run succeeds.

jackyenbioinfo commented 5 years ago

Hello Thanks. Is there's anyway to align the sequence and still able to run pvacseq with the different built? Because there's a specific version of the reference genome that we have been using and we are hoping to continue using for our project. What would you suggest? Thanks

susannasiebert commented 5 years ago

Without more information about which reference genome you've been using for variant calling and how you've been running VEP, we won't be able to answer that question.

jackyenbioinfo commented 5 years ago

Hi, we are using "hs37d5" version ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.

susannasiebert commented 5 years ago

How did you run VEP? Which version of VEP and the VEP cache are you using?

jackyenbioinfo commented 5 years ago

Hello, this is the command I used. /home/lee/NGS/pVACtools/ensembl-vep/vep \ --input_file /media/NGS_data/pvactools/CS436935_S5.vcf --output_file /media/NGS_data/pvactools/CS436935_S5_raw_vep.vcf \ --format vcf --vcf --symbol --terms SO --tsl \ --hgvs --fasta /home/lee/reference/hs37d5/hs37d5.fa \ --offline --cache /home/lee/.vep \ --plugin Downstream --plugin Wildtype

here's the version information Versions: ensembl : 96.7a35428 ensembl-funcgen : 96.9c3a0cd ensembl-io : 96.6e65b30 ensembl-variation : 96.70d2777 ensembl-vep : 96.0

Thanks Jack

m-two commented 5 years ago

Hi Jack,

The discrepancy you report is likely a common germline variant (dbSNP site). Refseq NM and NP transcript/protein sequences are not based on the reference sequence and frequently contain common germline variants compared to the genomic reference. The ENST and ENSP transcript/protein sequences are based on the genome reference and match the sequence exactly. If you have the genomic position or dbSNP ID for this site I can explain in greater detail.

-Mike M.

susannasiebert commented 5 years ago

Could you please also attach information about the variant that is causing this error? It should've been appended to the error message.

m-two commented 5 years ago

Hi Jack,

You may need to invoke --assembly=GRCh37 in your vep command, in addition to specifying the reference fasta. I believe the default assembly is assumed to be GRCh38 since Ensembl release 74 or 75. Last I heard, the annotation utilized for GRCh37 is still based on the data as of Feb 2014 when Ensembl 75 was released. http://feb2014.archive.ensembl.org/. If you find annotation differences, you may wish to refer to that archive source when comparing transcript alleles and protein amino acid residues.

Good Luck,

-Mike M

jackyenbioinfo commented 5 years ago

Hi @m-two Thanks! I'll give it a try and post my results here. Thanks for your help. Jack

m-two commented 5 years ago

On a similar note there was a recent announcement regarding "Matched Annotation from NCBI and EBI (MANE)" yesterday: http://www.ensembl.info/2019/04/16/we-are-making-mane-changes/

susannasiebert commented 5 years ago

@jackyenbioinfo did you have any success reannotating your VCF? I would still be interested to see the variant that is causing this problem so that we can do some testing on our end.

tdw1221 commented 5 years ago

Hi, I'm having a similar issue where I believe I use the correct version of reference genome, from the original vcf header I found out: ##reference=GRCh37-lite.fa, so I believed this is GRCH37, and I used 94_GRCh37 for vep annotation. This works out for all other vcf files in this batch except for one specific vcf, where I got an error message:

ERROR: There was a mismatch between the actual wildtype amino acid sequence (QQQQQHSNQTSNWSP) and the expected amino acid sequence (QQQQQQQQQQQQQQQ). Did you use the same reference build version for VEP that you used for creating the VCF? OrderedDict([('chromosome_name', '4'), ('start', '140811063'), ('stop', '140811105'), ('reference', 'TTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGC'), ('variant', 'TTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGC'), ('gene_name', 'MAML3'), ('transcript_name', 'ENST00000398940'), ('transcript_support_level', 'NA'), ('amino_acid_change', 'QQQQQQQQQQQQQQQ/QQQQQQQQQQQQQ'), ('codon_change', 'caGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAa/caGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAa'), ('ensembl_gene_id', 'ENSG00000196782'), ('hgvsc', ''), ('hgvsp', ''), ('wildtype_amino_acid_sequence', 'MSPAEQLKQMAAQQQQRAKLMQQKQQQQQQQQQQQQQQHSNQTSNWSPLGPPSSPYGAAFTAEKPNSPMMYPQAFNNQNPIVPPMANNLQKTTMNNYLPQNHMNMINQQPNNLGTNSLNKQHNILTYGNTKPLTHFNADLSQRMTPPVANPNKNPLMPYIQQQQQQQQQQQQQQQQQQPPPPQLQAPRAHLSEDQKRLLLMKQKGVMNQPMAYAALPSHGQVSVKVTHSSSSSCGAFDCRCVMFYKSTVVGSNCLQKTSLHPQVVARFGVFFFHFS'), ('downstream_amino_acid_sequence', ''), ('fusion_amino_acid_sequence', ''), ('variant_type', 'inframe_del'), ('protein_position', '34-37'), ('transcript_expression', 'NA'), ('gene_expression', 'NA'), ('normal_depth', 'NA'), ('normal_vaf', 'NA'), ('tdna_depth', '24'), ('tdna_vaf', 'NA'), ('trna_depth', 'NA'), ('trna_vaf', 'NA'), ('index', '21538.MAML3.ENST00000398940.inframe_del.34-37QQQQQQQQQQQQQQQ/QQQQQQQQQQQQQ'), ('protein_length_change', '')])

I also saw some other warinings in basically all other run of pvacseq, however since they did not give me an error I just ignore them:

Wildtype sequence length is shorter than desired peptide sequence length at position (22, 23263601, 23263602). Using wildtype sequence length (15) instead. Warning: Mismatch between the number of alternate alleles and number of values in the AD field for genotype Call(sample=TCGA-EB-A85J-10B-01D-A34X-08, CallData(GT=0/1, AD=[7, 13], DP=20, GQ=54, PL=[233, 0, 54, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None], SDP=None, RD=None, FREQ=None, PVAL=None, RBQ=None, ABQ=None, RDF=None, RDR=None, ADF=None, ADR=None))

Thanks

susannasiebert commented 5 years ago

We've added some information about this error to our documentation: https://pvactools.readthedocs.io/en/latest/pvacseq/errors.html. The problem is with your VEP annotation so I recommend contacting VEP for further assistance.

susannasiebert commented 5 years ago

Since there has been no more activity on this ticket, I'm resolving it. In the next couple of months we're planning on creating a validation tool that can be used to identify and remove variants with mismatched reference/transcript bases.

boyangzhao commented 3 years ago

Hi @susannasiebert - just following up on this. Was there ever a validation tool made to identify and remove mismatched variants? We have trying to use pvacseq generate fasta tool (pvacseq version 2.0.0) on germline variant calls and seem to see a similar error. We have gone through checklist from the Common Errors page of documentation, and confirm that we have the same build for VEP and for alignment, have also correctly defined --fasta and --assembly arguments. We routinely use pvacseq on somatic variants on the same assemblies and same arguments with no issues. This is the first time seeing this, and it was with a germline variant call vcf as input.

The error for us is,

2021-04-23T15:51:22.170221540Z ERROR: There was a mismatch between the actual wildtype amino acid sequence (REQEERLREQEEERLREQEERLCEQEERLCEQEERLCEQEKL) and the expected amino acid sequence (REQEERLREQEERLCEQEERLCEQEERLREQEERLCEQEERL). Did you use the same reference build version for VEP that you used for creating the VCF?
2021-04-23T15:51:22.170264859Z OrderedDict([('chromosome_name', '15'), ('start', '82344920'), ('stop', '82345046'), ('reference', 'ATAGCCTCTCCTCCTGTTCACATAGCCTCTCCTCCTGTTCACGTAGCCTCTCCTCCTGTTCACACAGCCTCTCCTCCTGTTCACATAGCCTCTCCTCCTGTTCACGTAGCCTCTCCTCCTGTTCACG'), ('variant', 'A'), ('gene_name', 'GOLGA6L10'), ('transcript_name', 'ENST00000619556.4'), ('transcript_support_level', '5'), ('amino_acid_change', 'REQEERLREQEERLCEQEERLCEQEERLREQEERLCEQEERL/-'), ('codon_change', 'CGTGAACAGGAGGAGAGGCTACGTGAACAGGAGGAGAGGCTATGTGAACAGGAGGAGAGGCTGTGTGAACAGGAGGAGAGGCTACGTGAACAGGAGGAGAGGCTATGTGAACAGGAGGAGAGGCTA/-'), ('ensembl_gene_id', 'ENSG00000278662'), ('hgvsc', 'ENST00000619556.4:c.712_798del'), ('hgvsp', 'ENSP00000483647.2:p.Arg238_Leu266del'), ('wildtype_amino_acid_sequence', 'MWPQPRLPPHPAMSEKTQQGKLAAAKKKLKAYWQRKSPGIPAGANRKKKVNGSSPDTATSGGYHSPGDSATGIYGEGRASSTTLQDLESQYQELAVALDSSSAIISQLTENINSLVRTSKEEKKHEIHLVQKLGRSLFKLKNQTAEPLAPEPPAGPSKVEQLQDETNHLRKELESVGRQLQAEVENNQMLSLLNRRQEERLREQEERLHEQEERLHEQEERLCEQEEERLCEQEERLREQEERLREQEEERLREQEERLCEQEERLCEQEERLCEQEKLPGQERLLEEVEKLLEQERRQEEQERLLERERLLDEVEELLEQERLRQQDERLWQQETLRELERLRELERLRELERMLELGWEALYEQRAEPRSGFEELNNENKSTLQLEQQVKELEKSGGAEEPRGSESAAAARPVPGAPVPQGAWMCGQAGWTPQEHPGLSGEAVGTGEAAGGAEEAACHSFRAAENRELNITII'), ('frameshift_amino_acid_sequence', ''), ('fusion_amino_acid_sequence', ''), ('variant_type', 'inframe_del'), ('protein_position', '238-266'), ('transcript_expression', 'NA'), ('gene_expression', 'NA'), ('normal_depth', 'NA'), ('normal_vaf', 'NA'), ('tdna_depth', '77'), ('tdna_vaf', '0.33766233766233766'), ('trna_depth', 'NA'), ('trna_vaf', 'NA'), ('index', '10807.GOLGA6L10.ENST00000619556.4.inframe_del.238-266REQEERLREQEERLCEQEERLCEQEERLREQEERLCEQEERL/-'), ('protein_length_change', '')])

Not sure if this is related to the comment made by m-two earlier,

The discrepancy you report is likely a common germline variant (dbSNP site).

susannasiebert commented 3 years ago

I made a tool but haven't had a chance yet to make a new VAtools release. It's currently in this PR: https://github.com/griffithlab/VAtools/pull/44. If you are familiar/comfortable with git and python you can checkout that branch and install VAtools directly from that branch to give it a try. I would be interested to hear feedback on the tool.

boyangzhao commented 3 years ago

Thanks!! I just ran it on the vcf file and it worked (below were the stats). Also, just ran the filtered vcf through pvacseq generate fasta and finished without problems. Thanks!

Total number of variants: 95243
Total number of processable variants (at least one missense, inframe indels, or frameshift transcript): 13384
Total number of variants with mismatched annotations: 2
Percentage of processable variants with mismatched annotations: 0.01%
Percentage of variants with mismatched annotations: 0.0%

Total number of transcripts: 765267
Total number of processable transcripts (missense, inframe indels, frameshifts): 34735
Total number of transcripts with mismatched annotations: 2
Percentage of processable transcripts with mismatched annotations: 0.01%
Percentage of all transcripts with mismatched annotations: 0.0%

As feedback: it was quite easy to use. One thing was it wasn't apparent to me on the first run (I just skimmed the help text) that without the --filter, the output was just outputting the mismatches, but I reread the help and all worked out in the end. I guess the logic is in the name of the tool (to report mismatches), and as secondary function to return the filtered list.

iichelhadi commented 3 years ago

I am having a similar issue with my annotation. I ran bwa alignment using UCSC hg38 reference genome build and Mutect2 with the same reference. After running VEP and trying to run pvacseq I got a similar error as boyangzhao. I tried VAtools with hard filter to remove mismatches but it didn't seem to work. For VEP I used the reference fasta that VEP suggests "Homo_sapiens.GRCh38.dna.toplevel.fa". Funny enough for another sequencing dataset I didn't get any issues. I tried running the bwa alignment with the same reference "Homo_sapiens.GRCh38.dna.toplevel.fa" but it doesn't seem to work with Mutect2. Any help would be greatly appreciated

susannasiebert commented 3 years ago

@iichelhadi I'm sorry you're running into problems. Could you provide more information for us to debug this issue?

The error message you're receiving
The pVACseq command you tried to run
The pVACtools version you are using
The unfiltered VCF file
The hard-filtered VCF file after running VAtools
The VAtools command you ran to create your hard-filtered VCF file Feel free to make a new ticket with all of this information.

iichelhadi commented 3 years ago

@iichelhadi I'm sorry you're running into problems. Could you provide more information for us to debug this issue?

* The error message you're receiving

* The pVACseq command you tried to run

* The pVACtools version you are using

* The unfiltered VCF file

* The hard-filtered VCF file after running VAtools

* The VAtools command you ran to create your hard-filtered VCF file
  Feel free to make a new ticket with all of this information.

Hi Susanna, Thank you for the reply. I repeated the whole process with a new reference from ensembl Homo_sapiens.GRCh38.dna.primary_assembly.fa hoping it might solve the issue. For VEP I am using version 103 and I use the fasta argument using the ensembl reference Homo_sapiens.GRCh38.dna.toplevel.fa which in terms of dna sequence should be identical to the Homo_sapiens.GRCh38.dna.primary_assembly.fa. When I tried running pvactools I got a different error.

Generating Variant Peptide FASTA and Key File
Traceback (most recent call last):
  File "/home/svu/phaei/.local/bin/pvacseq", line 8, in <module>
    sys.exit(main())
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/main.py", line 95, in main
    args[0].func.main(args[1])
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/run.py", line 122, in main
    pipeline.execute()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/pipeline.py", line 458, in execute
    self.generate_combined_fasta()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/pipeline.py", line 173, in generate_combined_fasta
    generate_combined_fasta.main(params)
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/generate_protein_fasta.py", line 155, in main
    generate_fasta(args.flanking_sequence_length, downstream_sequence_length, temp_dir, proximal_variants_tsv)
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/generate_protein_fasta.py", line 102, in generate_fasta
    fasta_generator.execute()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/fasta_generator.py", line 138, in execute
    position = int(line['protein_position'].split('-', 1)[0]) - 1
ValueError: invalid literal for int() with base 10: ''

the pvacseq command I ran: pvacseq run --iedb-install-directory /home/svu/phaei/IEDB --binding-threshold 500 --n-threads 24 --net-chop-method "cterm" --netmhc-stab --exclude-NAs \ --normal-sample-name 2440_N vcf_files2/2440_somatic.filtered.VEP.vcf 2440_T HLA-A*11:01 MHCflurry MHCnuggetsI NetMHC NetMHCpan pvacseq/2440/ The pvactools version I have installed 2.0.2

Would you still like me to open a new ticket for this issue?

regards

EL

susannasiebert commented 3 years ago

Yes, please make a new issue for this error and please also attach your input VCF to the issue for further debugging.

iichelhadi commented 3 years ago

Yes, please make a new issue for this error and please also attach your input VCF to the issue for further debugging.

done. Hope to hear from you soon.

griffithlab / pVACtools

ERROR: There was a mismatch between the actual wildtype amino acid sequence (Q) and the expected amino acid sequence (L). Did you use the same reference build version for VEP that you used for creati ng the VCF? #368