Closed susannasiebert closed 1 year ago
The TSL don't seem very useful for filtering. Many places have a list of standardized transcripts they prefer for annotation. CCDS and MANE SELECT might be more useful than TSL.
Also if you use the aggregated file as input and it has the wrong annotation then you have to go back to the VCF anyway.
In other cases two or even three isoforms may be expressed and multiple neoepitopes originating from one mutation would be desirable.
It also occurs that the Class I and Class II aggregated reports contain neoepitopes that can't be mapped to a single transcript isoform.
Mike has a good point. We discussed it at the meeting yesterday, and we agree that this issue isn't simple at all. However, this PR is a great start. Our goal here is to improve prioritization of transcript isoforms.
The "protein_coding" transcript biotype information (but not gene biotype) would also work for us first, though there are several other biotypes for protein-coding transcripts in Ensembl annotation database. They are rare.
From https://github.com/griffithlab/pVACtools/pull/870#issuecomment-1306756067 (credit @toddajohnson):
I was wondering about the impact of pVACtools's TSL filter last week, so I summarized genes in the Gencode build 38 (ENSEMBL 104) gtf for min. and max. TSL as well as if transcripts were labeled with other annotations used by Ensembl to prioritize gene models such as MANE, CCDS, etc.. Of 19670 protein-coding genes, 1043 only had transcripts with tslNA. From reviewing those tslNA only genes and Ensembl's TSL description, it looks to me like excluding tslNA would remove every single-exon gene. Of course, those may be small by size, but is that really the desired behavior? Also, by excluding tslNA and setting a default to TSL<=1 (while including tx. with no TSL annotation) would remove about 3000 (15%) of protein-coding genes. It seems that setting the ability to annotate using other Ensembl flags like MANE, APPRIS, Ensemble canonical, etc. might be more robust.
Right now, for each variant, we select the best peptide like so: Sort all entries in the all_epitopes file for the variant by lowest MT IC50, lowest TSL, and highest WT IC50. Then select the first entry.
In some cases this might lead to situation were there is a slightly worse binding epitope candidate with better TSL/biotype.
We discussed the following algorithm to determine the best peptide. For a set of epitope candidates (all_epitope file entries) of the same variant:
Furthermore, the tiering logic will add a requirement for the transcript to be of Biotype "protein_coding" in addition to the TSL < maximum-transcript-support-level requirement we already added.
The transcript ID of the best peptide will be written into a new column and the best transcript set, best transcript, and best epitope will be highlighted in the transcript detail tables.