griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
137 stars 59 forks source link

Improve selection of best peptide in the aggregate report #865

Closed susannasiebert closed 1 year ago

susannasiebert commented 1 year ago

Right now, for each variant, we select the best peptide like so: Sort all entries in the all_epitopes file for the variant by lowest MT IC50, lowest TSL, and highest WT IC50. Then select the first entry.

In some cases this might lead to situation were there is a slightly worse binding epitope candidate with better TSL/biotype.

We discussed the following algorithm to determine the best peptide. For a set of epitope candidates (all_epitope file entries) of the same variant:

  1. Select all entries with Biotype "protein_coding. If there are none, go back to the original set of entries
  2. From the set in step 1, further filter to select all entries with a TSL < maximum-transcript-support-level. If this results in no entries, go back to the set from step 1. (Note: the default maximum-transcript-support-level is 1)
  3. Determine the lowest ic50 value from the set in step 2 and select all entries with that ic50
  4. Determine the lowest TSL from the set in step 3 and select all entries with that TSL
  5. If there are more than one entry remaining, select the one with the longest transcript peptide sequence length.

Furthermore, the tiering logic will add a requirement for the transcript to be of Biotype "protein_coding" in addition to the TSL < maximum-transcript-support-level requirement we already added.

The transcript ID of the best peptide will be written into a new column and the best transcript set, best transcript, and best epitope will be highlighted in the transcript detail tables.

m-two commented 1 year ago

The TSL don't seem very useful for filtering. Many places have a list of standardized transcripts they prefer for annotation. CCDS and MANE SELECT might be more useful than TSL.

Also if you use the aggregated file as input and it has the wrong annotation then you have to go back to the VCF anyway.

m-two commented 1 year ago

In other cases two or even three isoforms may be expressed and multiple neoepitopes originating from one mutation would be desirable.

m-two commented 1 year ago

It also occurs that the Class I and Class II aggregated reports contain neoepitopes that can't be mapped to a single transcript isoform.

gschang commented 1 year ago

Mike has a good point. We discussed it at the meeting yesterday, and we agree that this issue isn't simple at all. However, this PR is a great start. Our goal here is to improve prioritization of transcript isoforms.

The "protein_coding" transcript biotype information (but not gene biotype) would also work for us first, though there are several other biotypes for protein-coding transcripts in Ensembl annotation database. They are rare.

susannasiebert commented 1 year ago

From https://github.com/griffithlab/pVACtools/pull/870#issuecomment-1306756067 (credit @toddajohnson):

I was wondering about the impact of pVACtools's TSL filter last week, so I summarized genes in the Gencode build 38 (ENSEMBL 104) gtf for min. and max. TSL as well as if transcripts were labeled with other annotations used by Ensembl to prioritize gene models such as MANE, CCDS, etc.. Of 19670 protein-coding genes, 1043 only had transcripts with tslNA. From reviewing those tslNA only genes and Ensembl's TSL description, it looks to me like excluding tslNA would remove every single-exon gene. Of course, those may be small by size, but is that really the desired behavior? Also, by excluding tslNA and setting a default to TSL<=1 (while including tx. with no TSL annotation) would remove about 3000 (15%) of protein-coding genes. It seems that setting the ability to annotate using other Ensembl flags like MANE, APPRIS, Ensemble canonical, etc. might be more robust.