plbenveniste / lung-treatment-response

Machine learning model for lung treatment response
MIT License
0 stars 0 forks source link

Data-preprocessing and merger of clinical and radiomics datasets #4

Open plbenveniste opened 3 months ago

plbenveniste commented 3 months ago

In this issue, I detail the work done to pre-process the data in both datasets:

The work done was the following:

1. Pre-process the clinical dataset

This was done using: data_preprocessing/1_clinical_data_preprocessing.py The data was extracted from the csv file, the columns were renamed, some extra lines containing statistics were removed (i.e. after line 181). Then for each column, the data was formatted to the right format (string, numeric or date). After that, some new columns were created to compute the length of time between the end of the treatment and various relapses or death. Finally, the preprocessed data was saved in a csv file.

2. Pre-process the radiomics dataset

This was done using: data_preprocessing/2_radiomics_data_preprocessing.py The radiomics dataset consisted of multiple csv files stored in a folder. First, we defined the columns which should be kept and which should be removed. We then build a column with the subject ID and the nodule ID. For each csv file, we had to look at the the values stored because the measures were done several times:

3. Merge both datasets

This was done using: data_preprocessing/3_merging_preprocessed_datasets.py The final step was to merge both pre-processed dataset. This last step was simple, it consisted of taking the preprocessed clinical dataset and the preprocessed radiomics dataset and find the corresponding patient nodule in the other dataset.

plbenveniste commented 3 months ago

In the script, data_preprocessing/5_eliminating_radiomics_features.py I explored the removal of certain features using the three following techniques:

For each of the above, the thresholds were selected so that the end result had 10 features remaining.

This was done for each model (death prediction, local relapse and distant relapse):

This is the ouput of the code for death prediction:

Initial number of features:  120

Model performance without any feature removal
ROC AUC Score:  0.7587412587412588
Brier score: 0.20039204887934706
Average precision: 0.6666666666666666
Average Recall: 0.36363636363636365
Accuracy Score:  0.7567567567567568
AUC-PR score: 0.6097461097461097

Number of features after variance thresholding: 95
Number of features removed by variance thresholding: 25

Model performance after variance thresholding
ROC AUC Score:  0.7517482517482518
Brier score: 0.22800967386850912
Average precision: 0.5
Average Recall: 0.36363636363636365
Accuracy Score:  0.7027027027027027
AUC-PR score: 0.5264127764127764

Number of features after correlation thresholding: 32
Number of features removed by correlation thresholding: 63

Model performance after feature selection based on correlation
ROC AUC Score:  0.7937062937062938
Brier score: 0.20656020649994028
Average precision: 0.5
Average Recall: 0.2727272727272727
Accuracy Score:  0.7027027027027027
AUC-PR score: 0.49447174447174447

Number of features after correlation with target thresholding: 10
Number of features removed by correlation with target thresholding: 22

Model performance after feature selection based on correlation with target
ROC AUC Score:  0.6783216783216783
Brier score: 0.2297800234826858
Average precision: 0.5555555555555556
Average Recall: 0.45454545454545453
Accuracy Score:  0.7297297297297297
AUC-PR score: 0.5861315861315861

Correlation of remaining features with target variable:
INTENSITY-BASED_MeanIntensity                     0.295758
INTENSITY-BASED_IntensitySkewness                -0.381733
INTENSITY-BASED_IntensityKurtosis                 0.297396
INTENSITY-BASED_10thIntensityPercentile           0.288538
INTENSITY-BASED_AreaUnderCurveCIVH                0.378669
INTENSITY-BASED_RootMeanSquareIntensity          -0.284614
INTENSITY-HISTOGRAM_IntensityHistogramMean        0.352289
INTENSITY-HISTOGRAM_IntensityHistogramVariance   -0.312502
NGTDM_Complexity                                 -0.283298
NGTDM_Strength                                   -0.357125
dtype: float64

This is the output of the code for local relapse prediction:

Initial number of features:  120

Model performance without any feature removal
ROC AUC Score:  0.275
Brier score: 0.15038857248809673
Average precision: 0.0
Average Recall: 0.0
Accuracy Score:  0.8378378378378378
AUC-PR score: 0.06756756756756757

Number of features after variance thresholding: 95
Number of features removed by variance thresholding: 25

Model performance after variance thresholding
ROC AUC Score:  0.45625000000000004
Brier score: 0.1338224235894703
/Users/plbenveniste/anaconda3/envs/venv_lung_response/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1517: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Average precision: 0.0
Average Recall: 0.0
Accuracy Score:  0.8648648648648649
AUC-PR score: 0.5675675675675675

Number of features after correlation thresholding: 32
Number of features removed by correlation thresholding: 63

Model performance after feature selection based on correlation
ROC AUC Score:  0.50625
Brier score: 0.15946981532981797
Average precision: 0.0
Average Recall: 0.0
Accuracy Score:  0.8108108108108109
AUC-PR score: 0.06756756756756757

Number of features after correlation with target thresholding: 10
Number of features removed by correlation with target thresholding: 22

Model performance after feature selection based on correlation with target
ROC AUC Score:  0.58125
Brier score: 0.13970229495919118
Average precision: 0.0
Average Recall: 0.0
Accuracy Score:  0.8378378378378378
AUC-PR score: 0.06756756756756757

Correlation of remaining features with target variable:
MORPHOLOGICAL_Volume                           0.096985
INTENSITY-BASED_StandardDeviation              0.148638
INTENSITY-BASED_MaximumIntensity               0.074359
INTENSITY-BASED_IntensityInterquartileRange    0.100201
INTENSITY-BASED_IntensityRange                 0.081047
INTENSITY-BASED_IntensityBasedEnergy           0.112164
INTENSITY-BASED_TotalLesionGlycolysis         -0.168269
GLCM_DifferenceAverage                         0.115111
GLCM_DifferenceVariance                        0.126764
NGTDM_Contrast                                 0.103546
dtype: float64

This is the output of the code for distant relapse prediction:

Initial number of features:  120

Model performance without any feature removal
ROC AUC Score:  0.5515873015873016
Brier score: 0.31458805858776767
Average precision: 0.18181818181818182
Average Recall: 0.2222222222222222
Accuracy Score:  0.5675675675675675
AUC-PR score: 0.2966147966147966

Number of features after variance thresholding: 95
Number of features removed by variance thresholding: 25

Model performance after variance thresholding
ROC AUC Score:  0.5873015873015872
Brier score: 0.29986175139174986
Average precision: 0.2857142857142857
Average Recall: 0.4444444444444444
Accuracy Score:  0.5945945945945946
AUC-PR score: 0.4326469326469326

Number of features after correlation thresholding: 32
Number of features removed by correlation thresholding: 63

Model performance after feature selection based on correlation
ROC AUC Score:  0.6468253968253967
Brier score: 0.28619899263551973
Average precision: 0.25
Average Recall: 0.3333333333333333
Accuracy Score:  0.5945945945945946
AUC-PR score: 0.37274774774774777

Number of features after correlation with target thresholding: 10
Number of features removed by correlation with target thresholding: 22

Model performance after feature selection based on correlation with target
ROC AUC Score:  0.5793650793650793
Brier score: 0.2600482894882669
Average precision: 0.3333333333333333
Average Recall: 0.4444444444444444
Accuracy Score:  0.6486486486486487
AUC-PR score: 0.45645645645645644

Correlation of remaining features with target variable:
MORPHOLOGICAL_Compacity                       -0.157922
MORPHOLOGICAL_CentreOfMassShift                0.059428
INTENSITY-BASED_IntensityInterquartileRange    0.058027
INTENSITY-BASED_AreaUnderCurveCIVH             0.057422
GLCM_SumVariance                              -0.055135
GLCM_ClusterShade                              0.072939
GLCM_ClusterProminence                        -0.102530
GLRLM_RunLengthNonUniformity                  -0.064647
NGTDM_Contrast                                 0.122924
GLSZM_LargeZoneLowGreyLevelEmphasis           -0.066517
dtype: float64