Closed manuelpastor closed 5 years ago
Example using caco2.sdf with defaults but for Conformal = False, method = PLS-R
With StandardScaler and PLS-R scale False it runs fine:
D:\Usuarios\manuel\documentos\soft\flame\flame>flame -c build -e CACO-PLS -p cacopls.yaml -f ..\mols\caco2.sdf
INFO - Starting building model CACO-PLS with file ..\mols\caco2.sdf and parameters cacopls.yaml
INFO - Running with input type: molecule
INFO - Starting normalization...
INFO - Computing molecular descriptors with methods ['RDKit_md']...
INFO - Computing RDKit descriptors...
INFO - Building model using internal toolkit : Sci-kit learn
INFO - Data scaled using StandarScaler
INFO - cv is: LeaveOneOut()
INFO - Starting model building
INFO - Starting model validation
INFO - Model finished successfully
nobj ( number of objects ) : 100.0
nvarx ( number of predictor variables ) : 200.0
model ( model type ) : PLSR quantitative
scoringR ( Scoring P ) : 0.2174
R2 ( Determination coefficient ) : 0.6391
SDEC ( Standard Deviation Error of the Calculations ) : 0.4663
scoringP ( Scoring P ) : 0.3319
Q2 ( Determination coefficient in cross-validation ) : 0.4491
SDEP ( Standard Deviation Error of the Predictions ) : 0.5761
If we put PLS-R scaler True we obtain absurd Q2 results:
(flame) D:\Usuarios\manuel\documentos\soft\flame\flame>flame -c build -e CACO-PLS -p cacopls.yaml -f ..\mols\caco2.sdf
INFO - Starting building model CACO-PLS with file ..\mols\caco2.sdf and parameters cacopls.yaml
INFO - Running with input type: molecule
INFO - Recycling data from D:\Usuarios\manuel\documentos\soft\flame\flame\models\CACO-PLS\dev\data.pkl
INFO - Building model using internal toolkit : Sci-kit learn
INFO - Data scaled using StandarScaler
INFO - cv is: LeaveOneOut()
INFO - Starting model building
INFO - Starting model validation
INFO - Model finished successfully
nobj ( number of objects ) : 100.0
nvarx ( number of predictor variables ) : 200.0
model ( model type ) : PLSR quantitative
scoringR ( Scoring P ) : 0.2174
R2 ( Determination coefficient ) : 0.6391
SDEC ( Standard Deviation Error of the Calculations ) : 0.4663
scoringP ( Scoring P ) : 33290243.172
Q2 ( Determination coefficient in cross-validation ) : -55261436.7437
SDEP ( Standard Deviation Error of the Predictions ) : 5769.7698
If we set scaler to None, we obtain other (different) absurd results on validation:
(flame) D:\Usuarios\manuel\documentos\soft\flame\flame>flame -c build -e CACO-PLS -p cacopls.yaml -f ..\mols\caco2.sdf
INFO - Starting building model CACO-PLS with file ..\mols\caco2.sdf and parameters cacopls.yaml
INFO - Running with input type: molecule
INFO - Recycling data from D:\Usuarios\manuel\documentos\soft\flame\flame\models\CACO-PLS\dev\data.pkl
INFO - Building model using internal toolkit : Sci-kit learn
INFO - Data scaled using StandarScaler
INFO - cv is: LeaveOneOut()
INFO - Starting model building
INFO - Starting model validation
INFO - Model finished successfully
nobj ( number of objects ) : 100.0
nvarx ( number of predictor variables ) : 200.0
model ( model type ) : PLSR quantitative
scoringR ( Scoring P ) : 0.2174
model ( model type ) : PLSR quantitative
scoringR ( Scoring P ) : 0.53
R2 ( Determination coefficient ) : 0.1202
SDEC ( Standard Deviation Error of the Calculations ) : 0.728
scoringP ( Scoring P ) : 794446036.4527
Q2 ( Determination coefficient in cross-validation ) : -1318771687.0669
SDEP ( Standard Deviation Error of the Predictions ) : 28185.9191
We have found that the raw descriptor matrix contains extremelly large values in the variable IPc (variance > 1e30). This variable in general and the particularly large IPc value for cyclosporin dominates the projection and produces abnormal results.
In order to mitigate the problem we will:
include a "RDKit md black list" allowing to remove problematic variables. IPc will be included by default
analyze the variance of the X matrix variables and create a log file. In the future we will include a "X sanitize" option allowing to filter out variables/values meeting certain criterial (very high or low variance, extreme outliers, etc.)
PLSR with scale = True produce wrong results. The r2 obtained in the optmization are large negative numbers. The issue must be investigated ASAP