griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
137 stars 59 forks source link

pVacFuse output filtered.tsv column description #1014

Closed min-codes closed 7 months ago

min-codes commented 1 year ago

Hello there, I would like to seek clarification for these columns in the filtered.tsv output file:

  1. Start / Stop column: Does "start" correspond to the genomic coordinates of the 5' partner of the fusion transcript (ie. before the breakpoint), and "stop" corresponds to that of the 3' partner? For example :

8 / 8 143579779 / 143574785 143597744 / 143577943 ENST00000532400_ENST00000435154 EEF1D_NAPRT

  1. Best IC50 Score : is the unit in nM ?
  2. Predicted Stability/ Half life/ Stability Rank : What is the unit, and does larger value mean higher stability? Is there documentation about their calculation methods?
  3. Reference match : why does some peptide have both TRUE and FALSE value? In the filtered.tsv file I see the same peptide showing up many times, with the only difference between each row is its transcript ID and reference match. Difference in transcript ID makes sense to me, but I'm unsure about the "reference match" column.

Also- I'd like to suggest removing space from output header & content to ease file handling 💯

Thank you!

susannasiebert commented 1 year ago

Thank you for your interest in pVACfuse. Some of your questions can be answered in our documentation, particularly the section on output files in pVACfuse.

(1) The values to the left of the / are the 5p coordinates and the ones to the right are 3p coordinates. (2) Correct (3) These values are directly reported from NetMHCstabpan so please see their documentation for details. You generally want a high Predicted Stability and high Half Life. The Stability Rank is the percentile of the Predicted Stability and NetMHCstabpan considers everything within the 0.5 percentile a good binder. (4) This column simply reports whether or not a reference match was found. You can check the .reference_matches file for details on where in the reference proteome a match was found. For each epitope we query for a larger region around the mutation of interest since immunotherapies will usually include a larger region around a neoantigen candidate. The exact query sequence can be found in the reference_matches file. The situation you described can happen since different transcripts will code for (slightly) different peptide sequences. The difference could be just outside of the neoantigen candidate and make the reference proteome query sequence slightly different between two transcripts. Without seeing your output files I can't say for certain though whether the behavior you are seeing is a bug or not. If you would like to send me the output files for this particular run, I would be happy to take a look to confirm.

min-codes commented 1 year ago

Thank you for your reply @susannasiebert . How about these columns ? I am trying to determine if "True" or "False" is good/bad. The first one is pretty self-explanatory, how about the other 3 (ie. is True good or bad)? I tried looking into literature but couldn't find much information. Are these output from a specific tool?

Regarding (4), I understand that the algorithm only returns "TRUE" for an exact reference match. For verification purpose, I did a manual search (ie. blastp against nr database) for one of the peptides that has "FALSE" for reference match column. This 11-mer peptide (ASLPSSWDYRK) has a 9 residue match with cytoskeleton associated protein 5 and a couple of other proteins (refer screenshot). While this is a more biological question, do you think a 9 out of 11 residue match is high enough for it to be consider a bad candidate?

image

susannasiebert commented 1 year ago

These are outputs from vaxrank. For the most part how to interpret them highly depends on your specific vaccine manufacturing company. E.g., one company we use categorically excludes Cysteines. If you have more specific guidance from your manufacturer, you can use the problematic amino acid feature to more specifically exclude/mark such peptides.

In regards to your ref match query, I'm seeing the same thing. One thing to note is that we use the refseq_select database by default, but that one returns one of the two matches you are seeing so I will investigate why pVACseq didn't catch this one.

We generally exclude any reference match that is at least 8 amino acids long since that is widely considered the minimum epitope length for the class I MHC complex so such peptides could reasonably bind.

susannasiebert commented 8 months ago

I apologize for the delay in investigating this issue further. I believe that part of the issue you are seeing is the word size parameter we are using when querying BLASTp. From my reading, it looks like the word size needs to be no more than half the length of the query sequence in order to not accidentally exclude any matches. Our chosen word size (7) seems to result in BLASTp not returning all matches for this particular sequence (ASLPSSWDYRK). After adjusting the word size to 5, I get the same results between online BLASTp and the API for both the refseq_select_prot and refseq_protein databases supported by pVACtools. I will make a bugfix PR to fix the word size we use.

As a side note, neither refseq_select nor refseq_protein seem to return the two specific hits that you noted for the nr database. However, this might be expected due to the different sequences included in the different databases.

susannasiebert commented 7 months ago

The issue with the BLASTp word size has been fixed in version 4.0.7. Please give it a try and let me know if your results now look as expected.