Find a machine learning model and hyperparameters with maximal precision on our current dataset

rvanasa commented 4 years ago

An extremely important challenge in antibody design is making "no-go" predictions to filter out antibodies that would not be able to bind to the target antigen. Since machine learning has been shown to improve the precision of these guesses, we can use a neural network or other machine learning model to significantly improve the results of the antibody screening process.

The dataset we have created has 512 columns and about 125,000 rows with boolean labels. This is a very similar setup to the famous MNIST handwritten digit classification task.

Because this task has lots of possible approaches, this is a perfect entry point if you want to learn how to design neural networks and/or have a clever idea for how to tackle this challenge.

Recommended Python packages:

Pandas (loading / manipulating tabulated data such as csv files)
NumPy (input data for most ML models)
Keras (included in TensorFlow for creating deep learning models)

Relevant papers:

rvanasa commented 4 years ago

Note: we want to focus on precision rather than accuracy, because in our case the "certainty" of results is more important than actually getting the correct results.

Woodsamr commented 3 years ago

Hi Ryan,

I am student of bio medical informatics and I came across your medium post. I am interested to use the data set in my deep learning project for the class. I want to to understand the dataset, do you have any reference that I can look up? Thank you for sharing your findings, they are extremely helpful.

Thanks, Sam

rvanasa commented 3 years ago

Sure! You can find table and column descriptions in the Kaggle dataset for this project:

Our deep learning model in network_training.ipynb uses a similar input structure to the random forest mentioned in this paper.

I'm guessing that features_contacts.csv is the dataset you'll find the most interesting, since it combines all the features we're using for our model. You can experiment with this Colab notebook to see how we set up the train/test data.

Let me know if there's anything else I could point you towards. Good luck on the project :)

Woodsamr commented 3 years ago

Ryan,

Thank you for your prompt response. I will review and let you know if I have any questions.

Regards, Sam.

On Mon, Oct 5, 2020 at 2:30 AM Ryan Vandersmith notifications@github.com wrote:

Sure! You can find table and column descriptions in the Kaggle dataset https://www.kaggle.com/rvanasa/monoclonal-antibodies for this project:

Our deep learning model in network_training.ipynb uses a similar input structure to the random forest mentioned in this paper https://www.sciencedirect.com/science/article/pii/S2211124718316851.

I'm guessing that features_contacts.csv is the dataset you'll find the most interesting, since it combines all the features we're using for our model. You can experiment with this Colab notebook https://colab.research.google.com/github/rvanasa/deep-antibody/blob/master/network_training.ipynb to see how we set up the train/test data.

Let me know if there's anything else I could point you towards. Good luck on the project :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rvanasa/deep-antibody/issues/1#issuecomment-703454771, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARIFDRJR47CHMAUBLTSFNFTSJFYZBANCNFSM4MLSYPYQ .

AsmaaAlafify commented 3 years ago

Hi Ryan,

I am a student in the Bioinformatics Department. I am interested to use the data (10 files in Kaggle -> https://www.kaggle.com/rvanasa/monoclonal-antibodies/version/1?select=cdr_flattened.csv) in my graduation project. I want to to understand the dataset, do you have any article or paper to explain the ten files and their relationship to each other? And I want to understand codes in Github (https://github.com/rvanasa/deep-antibody) if you have any references? Thanks for sharing the information you provide. Thanks, Asmaa

rvanasa commented 3 years ago

Hi Asmaa,

This project was heavily inspired by the paper Computational Design of Epitope-Specific Functional Antibodies, which explains the process of identifying existing antibodies that could potentially be modified to target a new antigen. I would recommend using this paper as a lens for understanding the purpose of this data.

Most of the data files are only used to construct features_contacts.csv, which is used to train our deep learning model.

Here is a brief description of each CSV file:

cdr_flattened contains the CDR residue sequences of antibodies in RCSB using the Contact definition (more info).
contacts_preprocessed is the distance between close antibody/antigen contact points in the RCSB models, usually in the CDR regions.
cov_preprocessed is a cleaned-up version of the CoV-AbDab database file.
docked_preprocessed contains the residue sequences for docked antibody-antigen pairs in the dataset.
docked_secondary associates secondary structure (predicted by DSSP) to the residue sequences.
features_contacts is essentially the dataset used for Computational Design of Epitope-Specific Functional Antibodies, which combines information from the rest of the CSV files.
thera_preprocessed is a cleaned-up version of the Thera-SAbDab database file.
thera_prioritized assigns a somewhat arbitrary "priority" for how much useful data is available for a therapeutic in Thera-SAbDab. We used this to improve our validation data.
windows_ag lists all the 9-residue subsequences of each antigen (positions 1-9, 2-10, 3-11, etc.), which is used to generate negative example cases for the model.
windows_cdr is the equivalent for each antibody CDR region. Since the CDR regions remain the same (whereas the antigen epitope can change depending on the antibody), we only care about these windows rather than every position on the antibody.

One way to understand the data is by using the structure visualizations provided by RCSB (for example, SARS-CoV-2). Each CSV file essentially stores a different useful aspect of these 3D models.

Hopefully this will be enough to get started. Let me know if you would like additional clarification. Cheers!

AsmaaAlafify commented 3 years ago

Thank you for all your assistance. I'll read more about the data. Best regards.

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Tue, Jan 26, 2021 at 6:00 AM Ryan Vandersmith notifications@github.com wrote:

Hi Asmaa,

This project was heavily inspired by the paper Computational Design of Epitope-Specific Functional Antibodies https://www.sciencedirect.com/science/article/pii/S2211124718316851, which explains the process of identifying existing antibodies that could potentially be modified to target a new antigen. I would recommend using this paper as a lens for understanding the purpose of this data.

Most of the data files are only used to construct features_contacts.csv, which is used to train our deep learning model https://colab.research.google.com/github/rvanasa/deep-antibody/blob/master/antibody_analysis.ipynb .

Here is a brief description of each CSV file:

cdr_flattened contains the CDR residue sequences of antibodies in RCSB https://www.rcsb.org/ using the Contact definition (more info http://www.bioinf.org.uk/abs/info.html).

contacts_preprocessed is the distance between close antibody/antigen contact points in the RCSB models, usually in the CDR regions.

cov_preprocessed is a cleaned-up version of the CoV-AbDab http://opig.stats.ox.ac.uk/webapps/covabdab/ database file.

docked_preprocessed contains the residue sequences for docked antibody-antigen pairs in the dataset.

docked_secondary associates secondary structure (predicted by DSSP https://swift.cmbi.umcn.nl/gv/dssp/DSSP_3.html) to the residue sequences.

features_contacts is essentially the dataset used for Computational Design of Epitope-Specific Functional Antibodies https://www.sciencedirect.com/science/article/pii/S2211124718316851, which combines information from the rest of the CSV files.

thera_preprocessed is a cleaned-up version of the Thera-SAbDab http://opig.stats.ox.ac.uk/webapps/newsabdab/therasabdab/search/ database file.

thera_prioritized assigns a somewhat arbitrary "priority" for how much useful data is available for a therapeutic in Thera-SAbDab. We used this to improve our validation data.

windows_ag lists all the 9-residue subsequences of each antigen (positions 1-9, 2-10, 3-11, etc.), which is used to generate negative example cases for the model.

windows_cdr is the equivalent for each antibody CDR region. Since the CDR regions remain the same (whereas the antigen epitope can change depending on the antibody), we only care about these windows rather than every position on the antibody.

One way to understand the data is by using the structure visualizations provided by RCSB (for example, SARS-CoV-2 https://www.rcsb.org/3d-view/6W41/1). Each CSV file essentially stores a different useful aspect of these 3D models.

Hopefully this will be enough to get started. Let me know if you would like additional clarification. Cheers!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rvanasa/deep-antibody/issues/1#issuecomment-767280893, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR4OS62OUEUJVL6EKEMT7DLS3Y47JANCNFSM4MLSYPYQ .

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

--

AsmaaAlafify commented 3 years ago

Hi Ryan,

I am a student in the Bioinformatics Department. I used the code of network training "https://github.com/rvanasa/deep-antibody/blob/master/network_training.ipynb" in my graduation project. Can you please tell me how you set all the values of the y-actual "Label" to (one) ? After using the generate_batch function the records of the y-actual came out as zeros and ones, So why some of them came out of zeros and ones what does this label depend on ? Thanks in advanced, Asmaa Sheham

rvanasa commented 3 years ago

Hello Asmaa,

Glad to hear that the code is helpful for your graduation project. The model classifies individual pairs of potential contact positions between the antidody/antigen, so the generate_batch function labels "correct" pairs (amino acids known to contact each other) with a 1, and likewise "incorrect" pairs (not in contact with each other) with a 0. This way, you will have a relatively balanced dataset for training.

Does this answer your question? Let me know if I could clarify further.

Cheers!

AsmaaAlafify commented 3 years ago

Thank you for your prompt response. I know that But my question is based on what ? you put in each record the label with a zero or a one? In other words, what is the information that makes you put in this row the label with 1 or zero. Thanks in advanced, Asmaa Sheham

rvanasa commented 3 years ago

Hi Asmaa,

Here's how the labels work for the 6W41 light chain, as a specific example:

Since this particular light chain has 221 amino acids and SARS-CoV-2 has 231, there are 221 * 231 = 51051 possible pairs (rows) which might contact each other when the antibody binds to the antigen. The pairs which are less than 3 angstroms apart (in the 3D model) are labeled with a 1, and the rest are labeled 0. Since the number of real contacts is relatively low compared to all possible pairs, the 0 rows are randomly selected from CDR regions on the light chain, which are the most likely areas for contacts to occur.

Essentially, we score each amino acid in the antibody CDR regions against every amino acid in the antigen, generating a 2D "heatmap" indicating which parts strongly interact with each other. In order to calculate the final score for an antibody-antigen pair, you can average, sum, or take the maximum score of all relevant predictions (depending on how you're using the model).

Let me know whether this makes more sense, and I will continue looking for ways to explain this as best as possible.

AsmaaAlafify commented 3 years ago

Yes, we understand this very well. Thank you for your great effort

On Thu, Jul 29, 2021, 7:42 AM Ryan Vandersmith @.***> wrote:

Hi Asmaa,

Here's how the labels work for the 6W41 https://www.rcsb.org/structure/6W41 light chain, as a specific example:

Since this particular light chain has 221 amino acids and SARS-CoV-2 has 231, there are 221 * 231 = 51051 possible pairs (rows) which might contact each other when the antibody binds to the antigen. The pairs which are less than 3 angstroms apart are labeled with a 1, and the rest are labeled 0. Since the number of real contacts is relatively low compared to all possible pairs, the 0 rows are randomly selected from CDR regions on the light chain, which are the most likely areas for contacts to occur.

Essentially, we score each amino acid in the antibody CDR regions against every amino acid in the antigen, generating a 2D "heatmap" indicating which parts strongly interact with each other. In order to calculate the final score for an antibody-antigen pair, you can average, sum, or take the maximum score of all relevant predictions (depending on how you're using the model).

Let me know whether this makes more sense, and I will continue looking for ways to explain this as best as possible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rvanasa/deep-antibody/issues/1#issuecomment-888819868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR4OS6ZQN5VT3N2DXNLVA2LT2DS5ZANCNFSM4MLSYPYQ .

--

rvanasa / deep-antibody

Find a machine learning model and hyperparameters with maximal precision on our current dataset #1