shaimaaK / AI-based-QSAR-for-Alzheimers-disease

AI-based Quantitative structure Activity relationship study for Alzheimer's disease
0 stars 0 forks source link
chembl chembl-web-client drug-discovery machine-learning qsar regression-models sklearn

AI-based-QSAR-for-Alzheimers-disease

AI-based Quantitative structure Activity relationship study for Alzheimer's disease project is implemented as part of the Data Mining Course in my Masters degree in AI. The project analyzes the Quantitative structure Activity relationship of the Amyloid beta A4 protein and the Alzheimer's disease where the activity of the protein is predicted as pIC50 standard value from the molecular structure.

Table of Content

Implementation Remarks

This project is implmented on Google Colab, hence all additional packages to be installed are documented and installed using shell commands in the colab project. The dataset is fetched from ChEMBL in 2021 using the ChEMBL webresource client API which is regularly updated hence an image of the dataset is saved for a reference.

Libraries Used

Data Retrieval API

Data Manipulation

Data Visualization

Protein Descriptors Computations

Preprocessing Steps

Step 1: Access the ChEMBL database and filter data, to exctract the data for Alzheimers disease where protein studies is Amyloid beta A4 protein.
Step 2: Handling missing , duplicated, and null data
Step 3: simply the simplified molecular input line-entry system (SMILE) notation e.g. handle disconnections in SMILEs notation
Step 4: Transforming attribute types according to the attribute nature
Step 5: Discretization of bio-activity to 3 levels: active, intermediate, inactive according to the standard value then eliminate intermediate level rows to focus on active/inactive instances
Step 6: Normalize IC50 value by computing pIC50 (negative logarithmic of IC50)
Step 7: Generate Padel discriptor using githuh project, from SMILES notation
Step 8: Drop identifier attribute
Step 9: Dimension reduction using VarianceThreshold method
Step 10: split the data to training and testing with spliting ratio 67% and 33%

Exploratory Data Analysis

class balance class whatever class whatever

Machine Learning Regression Problem

The problem at hand is a regression problem as the input to the regression model is the PaDEL descriptor that represents the footprint/descriptor of a molecule and try to predict the bio-activity value in pIC50 continuous-domain value hence the problem name Quantitative structure Activity relationship(QSAR). First the LazyRegressor library is used to norrow down to four good perfroming regression model then these models are compared to elect the best performing regression model which is further optimized by tuning its hyperparameter values.

Try Majority of Regression models using LazyRegressor

The LazyRegressor library runs 40 regression models including Support Vector Machine(SVM), Random Forest (RF), Adaboost regressor, decision tree regressor,and many more. The performance of the regressor models are evaluated according to the R-squared value , Root Mean Square Error (RMSE), and computation time.

r squared RMSE time

Evaluate four regression models and optimized best model

The following models are selected four models for regression

  1. Random Forest with 80 estimators
  2. Gradient Boost with 80 estimators
  3. Support Vector Machine with Radial Basis Function (rbf) kernal
  4. K Nearest Neighbor with k=10

where the performance is evaluated according to:

  1. Mean Absolute Error (MAE)
  2. R squared
  3. Computation time

According to table below the most promising regression is random forest thus the followig parameters are optimized using search grid method in order to optimize the performance of the random forest model:

According to grid search operation with cross validation (cv = 3 the best parameter values is n_estimators = 800 and max_depth = 8

Model R2 Score MAE Execution Time
Random Forest
0.7045

0.549

0.0091
Gradient Boosted Regressor
0.692

0.61

0.0015
K Nearest Neighbor
0.68

0.61

0.0016
Support Vector Machine
0.708

0.61

0.0018
Optimized Random Forest
0.928

0.29

0.0702


s