Closed jonasscheid closed 2 months ago
Hi Jonas, Just seen you opened this issue here and you are absolutely right, that script is not a great one and could be improved a lot.
About the first two points though, I think its not so important and you might want to consider these points: -> the redundant entries should not make much of a difference in the search + results and you might want to keep track which are all possible transcripts involved covering this variant. -> entering only the mutated peptides is not so easy as one would think, giving the length variants and including/excluding overlaps in there - Comet should handle this internally better. -> Finally there was a merge between theoretical/predicted neoepitopes and MS search identifications somewhere downstream in the pipeline script "resolve_neoepitopes.py". Back when we wrote this part we thought that would be the best way to deal with it.
If you have a better approach for this feel free to improve upon this part.
Hey Leon!
Thanks for the input!
Regarding point 1:
Suggesting you have a mutated sequence: SYFPEITHISIFPEITHI
, where the substring SI(mutated here)FPEITHI
is altered. Then you would let comet search for the wildtype subsequenceSYFPEITHIS
twice and potentially annotate a hit with a mutated accession, eventhough its wildtype. I would guess that is somehow resolved downstream with these python scripts. Maybe it would make sense to immediately pass a mutated accession through the pipeline and filter for them at the end or sth. But I guess I need to understand first whats happening in these python scripts 😄 .
Regarding point 2: I agree, however i think we can benefit now from the implementation in nf-core/epitopeprediction. There, also more complex variants are handled, e.g. transcripts with multiple mutation sites
Ok sure and sounds good to replace parts of it with the epitopeprediction - especially point 2. Just wanted to add this quickly in here without going to much into detail with the SYFPEITHI sequence that the way it was done for point 1 was intentionally not a bug, because we didn't figure out another better way back then.
Description of the bug
The process
GENERATE_PROTEINS_FROM_VCF
and the underlying scriptvariants2fasta.py
need to get a complete overhaul. A few issues I ran into:protein_postion
is provided asposition/length
string