phi-grib / flame

Modeling framework for eTRANSAFE project
GNU General Public License v3.0
12 stars 10 forks source link

PLSR errors with some data scaling combinations #152

Closed manuelpastor closed 5 years ago

manuelpastor commented 5 years ago

PLSR with scale = True produce wrong results. The r2 obtained in the optmization are large negative numbers. The issue must be investigated ASAP

manuelpastor commented 5 years ago

Example using caco2.sdf with defaults but for Conformal = False, method = PLS-R

With StandardScaler and PLS-R scale False it runs fine:

D:\Usuarios\manuel\documentos\soft\flame\flame>flame -c build  -e CACO-PLS -p cacopls.yaml -f ..\mols\caco2.sdf
INFO - Starting building model CACO-PLS with file ..\mols\caco2.sdf and parameters cacopls.yaml
INFO - Running with input type: molecule
INFO - Starting normalization...
INFO - Computing molecular descriptors with methods ['RDKit_md']...
INFO - Computing RDKit descriptors...
INFO - Building model using internal toolkit : Sci-kit learn
INFO - Data scaled using StandarScaler
INFO - cv is: LeaveOneOut()
INFO - Starting model building
INFO - Starting model validation
INFO - Model finished successfully
       nobj ( number of objects ) : 100.0
       nvarx ( number of predictor variables ) : 200.0
       model ( model type ) : PLSR quantitative
       scoringR ( Scoring P ) : 0.2174
       R2 ( Determination coefficient ) : 0.6391
       SDEC ( Standard Deviation Error of the Calculations ) : 0.4663
       scoringP ( Scoring P ) : 0.3319
       Q2 ( Determination coefficient in cross-validation ) : 0.4491
       SDEP ( Standard Deviation Error of the Predictions ) : 0.5761

If we put PLS-R scaler True we obtain absurd Q2 results:

(flame) D:\Usuarios\manuel\documentos\soft\flame\flame>flame -c build  -e CACO-PLS -p cacopls.yaml -f ..\mols\caco2.sdf
INFO - Starting building model CACO-PLS with file ..\mols\caco2.sdf and parameters cacopls.yaml
INFO - Running with input type: molecule
INFO - Recycling data from D:\Usuarios\manuel\documentos\soft\flame\flame\models\CACO-PLS\dev\data.pkl
INFO - Building model using internal toolkit : Sci-kit learn
INFO - Data scaled using StandarScaler
INFO - cv is: LeaveOneOut()
INFO - Starting model building
INFO - Starting model validation
INFO - Model finished successfully
       nobj ( number of objects ) : 100.0
       nvarx ( number of predictor variables ) : 200.0
       model ( model type ) : PLSR quantitative
       scoringR ( Scoring P ) : 0.2174
       R2 ( Determination coefficient ) : 0.6391
       SDEC ( Standard Deviation Error of the Calculations ) : 0.4663
       scoringP ( Scoring P ) : 33290243.172
       Q2 ( Determination coefficient in cross-validation ) : -55261436.7437
       SDEP ( Standard Deviation Error of the Predictions ) : 5769.7698

If we set scaler to None, we obtain other (different) absurd results on validation:

(flame) D:\Usuarios\manuel\documentos\soft\flame\flame>flame -c build  -e CACO-PLS -p cacopls.yaml -f ..\mols\caco2.sdf
INFO - Starting building model CACO-PLS with file ..\mols\caco2.sdf and parameters cacopls.yaml
INFO - Running with input type: molecule
INFO - Recycling data from D:\Usuarios\manuel\documentos\soft\flame\flame\models\CACO-PLS\dev\data.pkl
INFO - Building model using internal toolkit : Sci-kit learn
INFO - Data scaled using StandarScaler
INFO - cv is: LeaveOneOut()
INFO - Starting model building
INFO - Starting model validation
INFO - Model finished successfully
       nobj ( number of objects ) : 100.0
       nvarx ( number of predictor variables ) : 200.0
       model ( model type ) : PLSR quantitative
       scoringR ( Scoring P ) : 0.2174
       model ( model type ) : PLSR quantitative
       scoringR ( Scoring P ) : 0.53
       R2 ( Determination coefficient ) : 0.1202
       SDEC ( Standard Deviation Error of the Calculations ) : 0.728
       scoringP ( Scoring P ) : 794446036.4527
       Q2 ( Determination coefficient in cross-validation ) : -1318771687.0669
       SDEP ( Standard Deviation Error of the Predictions ) : 28185.9191
manuelpastor commented 5 years ago

We have found that the raw descriptor matrix contains extremelly large values in the variable IPc (variance > 1e30). This variable in general and the particularly large IPc value for cyclosporin dominates the projection and produces abnormal results.

In order to mitigate the problem we will:

  1. include a "RDKit md black list" allowing to remove problematic variables. IPc will be included by default

  2. analyze the variance of the X matrix variables and create a log file. In the future we will include a "X sanitize" option allowing to filter out variables/values meeting certain criterial (very high or low variance, extreme outliers, etc.)