Add a KINOMEscan handler to obtain the amino acid sequences from their kinase panel. Currently available information only discloses:
NCBI Protein accession number
Mutations applied (substitutions, insertions, deletions...) using common biological notation (S123P, T165ins, etc).
Synthesized subsequence
Todos
Notable points that this PR has either accomplished or will accomplish.
[X] Add a Biosequence class to handle mutations and subsequences specified with biological notation, which is 1-indexed and often includes bounds on both ends.
[X] Add a KINOMEscan class that takes the original spreadsheet shared by the DiscoverX team and gets the sequences, applies mutations and cuts to desired length.
[ ] Check if my assumptions on what the biological notation specifies are correct (especially when it comes to including interval ends or not).
Questions
[ ] Biosequence can only handle substitutions and either insertion or deletion at the same time, because the indices will change after these two operations are performed. Some bookkeeping will be needed to deal with an arbitrary number of mutations (regardless the type), but I wonder if this is necessary. Only one entry in the KINOMEscan panel presents this problem.
[ ] Some mutations are not fully specified: Sins (which could be a Cys insertion, but position is not specified), ITD (internal tandem duplication, only applicable to FLT3, but couldn't find the specific change in sequence). How do we deal with that? Right now, I am just NaNing them out...
[ ] Can we redistribute the DiscoverX spreadsheet? I guess we can, but just wanted to double check before adding it to the git history.
Description
Add a KINOMEscan handler to obtain the amino acid sequences from their kinase panel. Currently available information only discloses:
S123P
,T165ins
, etc).Todos
Notable points that this PR has either accomplished or will accomplish.
Biosequence
class to handle mutations and subsequences specified with biological notation, which is 1-indexed and often includes bounds on both ends.KINOMEscan
class that takes the original spreadsheet shared by the DiscoverX team and gets the sequences, applies mutations and cuts to desired length.Questions
Biosequence
can only handle substitutions and either insertion or deletion at the same time, because the indices will change after these two operations are performed. Some bookkeeping will be needed to deal with an arbitrary number of mutations (regardless the type), but I wonder if this is necessary. Only one entry in the KINOMEscan panel presents this problem.Sins
(which could be a Cys insertion, but position is not specified),ITD
(internal tandem duplication, only applicable toFLT3
, but couldn't find the specific change in sequence). How do we deal with that? Right now, I am justNaN
ing them out...Status