griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
145 stars 59 forks source link

help with pVACseq customization #1101

Closed javierAPC closed 6 months ago

javierAPC commented 6 months ago

Hi, I'm working on a project to extract features from PAAD patients from the ICGC database for neoantigen immunogenicity prediction in ML. This project follows the steps taken by the model creators. I'm currently at this step of the process: "High confidence somatic variants affecting protein-coding genes, following GENCODE v38 annotation, were used to generate tumor-specific mutations (25mers) and class-I neo-peptides (8-12mers). Neo-peptides that also match a WT sequence are discarded."

They don't use pVACtools, but I plan to use it to:

  1. Generate the mers from the VCFs.
  2. Obtain some of the features (data from NetMHCpan, NetStabpan, and NetChop).

I'm using this command: pvacseq run APGI-AU_DO34584_hc_gatk-mutect2_ann_fil.vcf.gz SA410803 "alleles" NetMHCpan output_dir -p APGI-AU_DO34584_hc_gatk-mutect2_ann_fil.vcf.gz -e 8,9,10,11,12 --normal-sample-name SA410795 --iedb-install-directory /opt/iedb -t 9 -d 12 I noticed that this command doesn't have the --flanking_sequence_length argument to change the default 10, and I'm not interested in the filter step, but the stability and cleavage data aren't added before that part. Now, my question is the following: Which parts of the code do I need to change in order to Establish my new flanking_sequence_length and disable the hard filter?

I know that I can just use the Optional Downstream Analysis Tools, but I would like to know if there's a more direct way to get what I want.

susannasiebert commented 6 months ago

Thank you for your interest in pVACtools. I'm happy to provide guidance for your questions.

Establish my new flanking_sequence_length

When running pVACseq, we do not provide the option to specify a flanking sequence length. Instead, the tool creates a temporary set of fasta files with flanking lengths depending on each selected epitope length. e.g. for 8mers we create a temporary fasta file with flanking length of 7, and so on for each epitope length. This is to remove unnecessary neoantigen candidate windows from prediction calls that would exclude the mutation.

We also create a "master" fasta file with a flanking length of your max epitope length - 1 for the user's reference. In your case, this would have a flanking length of 11 for a total peptide length of 23 because your largest selected epitope length is 12. This fasta can be found under MHC_Class_I directory and should be named SA410803.fasta.

If you require a fasta file with a 12mer flanking length for a total of 25mer peptide length, you may run the standalone pvacseq generate_protein_fasta with your desired flanking length. Please ensure that you provide the same parameters to this command as you used for your pVACseq run, where appropriate.

disable the hard filter

We do not provide the option to disable the hard filters in pVACseq. After the all_epitopes.tsv file is created you can abort your run and use the all_epitopes.tsv file to run the standalone pvacseq net_chop and pvacseq netmhc_stab commands.

Alternatively you can also use the pvacseq generate_protein_fasta command to create your desired peptide fasta and then run NetMHCpan, NetStabpan, and NetChop standalone (outside of pVACtools) using the resulting fasta. The advantage to using pvacseq run is that the all_epitopes.tsv file will contain a lot of meta-information in terms of positions, genes, transcripts, TSL, biotype, etc. as well as coverage and expression information if the VCF file has those annotations. The file also includes matched wildtype epitopes for each neoantigen candidate and their predicted binding affinity values. If none of that data is desired or necessary, then simply using pvacseq generate_protein_fasta would be the way to go. You can then also use the --mutant-only only option in this command to only include mutated sequences in the fasta file.

susannasiebert commented 6 months ago

Closing this issue due to inactivity.