Adding generated proteins to reference database is erroneous

nf-core / mhcquant

Identify and quantify MHC eluted peptides from mass spectrometry raw data

https://nf-co.re/mhcquant

MIT License

32 stars 25 forks source link

Adding generated proteins to reference database is erroneous #248

Closed jonasscheid closed 2 months ago

jonasscheid commented 1 year ago

Description of the bug

The process GENERATE_PROTEINS_FROM_VCF and the underlying script variants2fasta.py need to get a complete overhaul. A few issues I ran into:

At least variant of the test data set produces additional protein entries, where 4 are redundant. There should be actually only one mutated protein added based on the varaint, since it is a missense mutation.
The whole mutated protein is added instead of only the mutated peptides, which could introduce falsely annotated neo-epitopes
Parsing VEP files results in multiple issues. E.g. the protein position is expected to be an integer, however with VEP annotations v4.1 protein_postion is provided as position/length string
Somehow VEP annotated variants and their corresponding transcript ID are not found in ensembl, eventhough it is present on the webiste.

Leon-Bichmann commented 1 year ago

Hi Jonas, Just seen you opened this issue here and you are absolutely right, that script is not a great one and could be improved a lot.

About the first two points though, I think its not so important and you might want to consider these points: -> the redundant entries should not make much of a difference in the search + results and you might want to keep track which are all possible transcripts involved covering this variant. -> entering only the mutated peptides is not so easy as one would think, giving the length variants and including/excluding overlaps in there - Comet should handle this internally better. -> Finally there was a merge between theoretical/predicted neoepitopes and MS search identifications somewhere downstream in the pipeline script "resolve_neoepitopes.py". Back when we wrote this part we thought that would be the best way to deal with it.

If you have a better approach for this feel free to improve upon this part.

jonasscheid commented 1 year ago

Hey Leon!

Thanks for the input!

Regarding point 1: Suggesting you have a mutated sequence: SYFPEITHISIFPEITHI, where the substring SI(mutated here)FPEITHI is altered. Then you would let comet search for the wildtype subsequenceSYFPEITHIS twice and potentially annotate a hit with a mutated accession, eventhough its wildtype. I would guess that is somehow resolved downstream with these python scripts. Maybe it would make sense to immediately pass a mutated accession through the pipeline and filter for them at the end or sth. But I guess I need to understand first whats happening in these python scripts 😄 .

Regarding point 2: I agree, however i think we can benefit now from the implementation in nf-core/epitopeprediction. There, also more complex variants are handled, e.g. transcripts with multiple mutation sites

Leon-Bichmann commented 1 year ago

Ok sure and sounds good to replace parts of it with the epitopeprediction - especially point 2. Just wanted to add this quickly in here without going to much into detail with the SYFPEITHI sequence that the way it was done for point 1 was intentionally not a bug, because we didn't figure out another better way back then.