griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
141 stars 59 forks source link

Indels processing #120

Closed immuneAI closed 6 years ago

immuneAI commented 6 years ago

Hi guys, Wildtype and Downstream plugins not able to annotated GATK indels vcf file. Amino acid sequences were not generated in VEP processing. Do i need to change any parameter here?

susannasiebert commented 6 years ago

What parameters did you use to annotate your VCF with VEP? Maybe paste the whole VEP command. Most likely the Downstream and Wildtype plugins were not found in the VEP plugins directory.

immuneAI commented 6 years ago

I have no problem in annotating missense variants with the same command and path to plugins VEP command:

variant_effect_predictor.pl -i Indel_GATK_Tumor.vcf --output_file VEP_Indel.vcf --vcf --symbol --terms SO --dir /reference/software/vep --symbol --dir_plugins /mnt/lustre/reference/software/vep/Plugins --plugin Downstream --plugin Wildtype --pick --coding_only --offlice GRCh37

susannasiebert commented 6 years ago

Interesting, what is the CSQ header in your annotated indels VCF?

immuneAI commented 6 years ago

INFO=

susannasiebert commented 6 years ago

So it looks like the Downstream plugin runs successfully, since the DownstreamProtein field exists in the header. Can you please double check that the Wildtype.pm plugin file is still in /mnt/lustre/reference/software/vep/Plugins? Maybe it got deleted on accident.

immuneAI commented 6 years ago

Nope, Wildtype plugin is there in it's path. I have no problem with missense variants. stuck with indels

susannasiebert commented 6 years ago

I would suggest checking the VEP annotation log to see if it throws an warning about the Wildtype plugin. I'm not sure why it wouldn't annotate it. The fact that the WildtypeProtein field isn't in the header shows me that it isn't run at all, which in my experience only happens when it can't find the plugin. If the log don't contain any warnings, I suggest putting in a ticket with Ensembl.

immuneAI commented 6 years ago

I have checked through logs of VEP. Checking every detailed step analysis from long time on this. I have raised the issue on ensemble-vep. please find there response in https://github.com/Ensembl/ensembl-vep/issues/153

susannasiebert commented 6 years ago

What do the logs say about the Wildtype plugin? It should at least mention that it is trying to run it. Can you please post more detail or I won't be able to help you with your issue. If you could post an example of a log file, that would be useful.

Can you upgrade VEP to the latest version, like suggested in Ensembl/ensembl-vep#153, and try to reannotate your files??

immuneAI commented 6 years ago

Figured out with VEP version, annotating successfully, But when i ran pvaseq tsv file missing INDELS into input, throwing this warning and continues with only missense variants.

Executing MHC Class I predictions Converting .vcf to TSV Allele number not found in list of alleles Allele number not found in list of alleles Allele number not found in list of alleles Allele number not found in list of alleles Allele number not found in list of alleles Completed Splitting TSV into smaller chunks Generating Variant Peptide FASTA and Key Files Generating Variant Peptide FASTA and Key Files - Entries 1-132 Completed Processing entries for Allele HLA-A02:01 and Epitope Length 9 - Entries 1-132 Running IEDB on Allele HLA-A02:01 and Epitope Length 9 with Method NetMHCpan - Entries 1-132

susannasiebert commented 6 years ago

Can you please post an example indel VCF entry that gets filtered out as well as the CSQ header from the VCF?

immuneAI commented 6 years ago

25_test.vcf.gz

Check this one

susannasiebert commented 6 years ago

None of the indels in your VCF are annotated by VEP as a frameshift_variant, inframe_insertion, or inframe_deletion, which are the only three indel types supported by pVACseq.

immuneAI commented 6 years ago

This one has frameshift_variant and deletions. please have a look. Thanks for help! Indel_filtered.vcf.gz

susannasiebert commented 6 years ago

I see a couple of things going on here where VEP seems to have changed their output format and it's different from what pVACseq is expecting. (1) the Allele field in the CSQ entries for these indels is the literal string deletion instead of the actual alt allele. pVACseq uses this field to find the appropriate CSQ entries for each alternate allele, since in a multi-alt site there would be CSQ entries for each alternate allele. Right now pVACseq isn't able to find matching CSQ entries in your VCF because of that. (2) Once (1) is resolved, the Consequence string format in the CSQ entries has changed as well. Right now pVACseq expects multiple entries in the Consequence field to be separated by a &. In your VCF, they are separated by a .. pVACseq will split on the delimiter and check each consequence individual for supported consequence types instead of doing a fuzzy match on the whole consequence string. Because of this, pVACseq doesn't detect the frameshift_deletion or inframe_deletion consequences.

I'm not sure if these formatting changes are because of the version of VEP being used or if it's because of how VEP was run. Either way, I'm happy to add support for this new format. Do you happen to have any examples for an inframe_insertion? All the examples I'm seeing are for deletions and I would like to make sure we parse all types of variants correctly in the new format.

Unfortunately, I will leave on vacation for two weeks on Friday. It's unlikely that a new release will be ready by then. I will work on a pull request tomorrow and am happy to give you instructions on how to install pVACtools directly from a GitHub branch so that you can try out the bugfix until it is release into production.

immuneAI commented 6 years ago

Here is what i have done so far, annotated with VEP91 and filtered variants with filter filter_vep.pl --filter "consequence is frameshift_variant or consequence is inframe_deletion or consequence is inframe_insertion" Deletion_FramshiftVariant.vcf.gz This one has deletions and frameshift. Please check this entries

pvacseq command:

pvacseq run --iedb-install-directory /software/IEDB_MHC/IEDB_MHC-20170201 -b 500 -c 1 -t -e 9 -a sample_name --netmhc-stab --net-chop-method cterm --fasta-size=600 input.vcf 10999999 HLA-A*01:01 NetMHCpan/pVAC_Output/Test_T

Executing MHC Class I predictions Converting .vcf to TSV Completed The TSV file is empty. Please check that the input VCF contains missense, inframe indel, or frameshift mutations.

susannasiebert commented 6 years ago

Ok, let me know if you encounter any cases like this for an inframe_insertion. For now I will add support for this new format for inframe_deletion and frameshift_variant.

susannasiebert commented 6 years ago

For the Indel_filtered.vcf.gz file I'm also running into the problem where the frameshift variants do not actually have any contents for the DownstreamProtein sequence field. The other variants don't have any contents for the Amino_acids field. This causes downstream errors. You might want to try to re-annotate it with the same workflow you used to create Deletion_FramshiftVariant.vcf.gz. That file doesn't have the same problems. I'm still working on creating a PR to fix some of the issues you are seeing and add more warning messages when these variants are skipped because of problems with the DownstreamProtein and Amino_acids fields.

susannasiebert commented 6 years ago

I made a pull request to address some of these problems you are seeing (#129). You can install pVACtools from the branch by running pip install -e git+git://github.com/griffithlab/pVACtools@vep_formats#egg=pvactool. You might need to uninstall your current version of pVACtools first (pip uninstall pvactools) to avoid conflicts.

susannasiebert commented 6 years ago

I just made a new release (version 1.0.6) that incorporates this hotfix.