Number of variants used from the VEP annotated VCF file

ahwanpandey commented 5 years ago

According to the following statement:

pVACseq makes predictions for all transcripts of a variant that were annotated as missense_variant, inframe_insertion, inframe_deletion, inframe protein_altering_variant, or frameshift_variant by VEP as long as the transcript was not also annotated as start_lost. In addition, pVACseq only includes variants that were called as homozygous or heterozygous variant. Variants that were not called in the sample specified are skipped (determined by examining the GT genotype field in the VCF).

How do I find the actual number of variants that pVACseq is using downstream for the prediction?

susannasiebert commented 5 years ago

All variants that are supported by pVACseq will be written to an intermediate .tsv file that will be located in the MHC_Class_I and MHC_Class_II directories. Please note, that each variant may have multiple entries, one for each transcript. For more information on the output files pVACseq creates please see our documentation.

ahwanpandey commented 5 years ago

Thanks @susannasiebert !

I have a few more questions for you. I am new to the neoantigen prediction landscape so hopefully my questions don't sound too ignorant.

1) I am running HLA-VBSeq to HLAtype the patient data. It seems HLA-VBSeq only outputs HLA-A, HLA-B, HLA-C, DQA1, DQB1 and DRB1 types. Will I be missing potential neoantigens if I am not including say HLA-E, HLA-F, DPA1, DPB1 etc.. ? I am also thinking of including HLAminer results which seems to support more Genes.

2) HLA-VBSeq can output up to 8 digit resolution HLA-types. So far for my test I am feeding pVACSeq 4 digit resolution HLA-types. I noticed if I go up to 6-digits then a lot of them are incompatible. Is is safe to just stick with 4 digits to get a meaningful result?

3) Does it suffice to annotate the variants with the VEP flags "--coding_only" and "--no_intergenic"? And then only feed pVACSeq with the variants with a CSQ annotation? I ask this because the RNA count addition step is faster if I can feed it this smaller list of variants that are supported by pVACSeq rather than all the somatic variants found. Is there any harm in doing this?

susannasiebert commented 5 years ago

1) I hope that one of our Bioinformaticians can chime in but I don't believe we run HLA-E, HLA-F, or DPA in our immunotherapy pipeline. Most of the algorithms don't support them anyway so you're probably ok leaving them off 2) Yes, 4 digits is sufficient since most, if not all, prediction algorithms in pVACseq only go to that resolution anyway 3) We haven't tried running our annotation with those flags but that should work. Alternatively you can grep your VCF for entries with missense_variant, inframe_insertion, inframe_deletion, protein_altering_variant, or frameshift_variant. Those are all of the consequence types pVACseq supports.

ahwanpandey commented 5 years ago

Hey @susannasiebert

I actually figured out that HLA-VBSeq supports a lot more genes, it just doesn't output them by default. I am planning on including all the human HLA genes mentioned here. https://ghr.nlm.nih.gov/primer/genefamily/hla Does this seem reasonable to you? Is there a list of HLA types you could share that you use in your pipelines for human HLA sites?
Sounds good!
Makes sense!

Thank you.

susannasiebert commented 5 years ago

I’m not exactly sure what you’re asking. It doesn’t really matter which HLA types can be predicted by an HLA typing tool or to what resolution if those HLA types aren’t supported by the prediction algorithms pVACseq supports. That list is available using the pvacseq valid_alleles command.

susannasiebert commented 5 years ago

I think I've answered your question so I will close this issue but feel free to reopen or make a new issue for any additional questions/problems you might have.

griffithlab / pVACtools

Number of variants used from the VEP annotated VCF file #370