hubin-keio / Spike_NLP

Use NLP to study the spike protein in SARS-CoV-2 virus.
Other
3 stars 0 forks source link

Prepare Models to use with pretrained embeddings #9

Open Michal-Babins opened 1 year ago

Michal-Babins commented 1 year ago

Task: Prepare a serious of models to test pre-trained embeddings on affinity binding data.

Models to use: Graph Neural Net Densely Connected NN LSTM

To Consider: Simple Logistic Regression Random Forest SVM

Michal-Babins commented 1 year ago

@kae-gi we need to make sure we can load in the pretrianed model and use it to embed the rbd sequences. Using a best performing saved model, test this out on the dms dataset (binding_Kds.csv). So make sure you can load in the binding kds csv and transform the reference sequence to contain the corresponding mutation. You can find the way to do that here: https://github.com/hubin-keio/ASM_NGS_2022/blob/master/models/blstm.py (BindingDataset class), and then use the mutated sequence to be embedded by the BERT model and the binding affinity for the predictions. Once this is done, we can move on to testing our model prediction vs others.

hubin-keio commented 1 year ago

We will need to use the best performing model to generate the embeddings and use ONLY the embeddings for the LSTM and other models for benchmark. Anything beyond the embeddings are not needed for those models.

Perhaps it is good to generate a csv file with all the sequence identifiers, sequences, and the embeddings so that the csv file can be reused in different models.

Michal-Babins commented 1 year ago

@kae-gi we will need a clustering visualization for all the beta-coronavirus using embeddings from the RBD model.

We will also need to train BERT on the alphaseq data set, and use the best performing model to predict affinity prediction. The alphaseq looks at scfv region that forms a paratope that binds to the epitope.

Get phylogeny level distribution of the current RBD from wild type.