Evaluating existing SCT models

plbenveniste commented 1 week ago

This issue reports the work done to evaluate the existing models.

The existing models are the following:

sct_deepseg_lesion
sct_deepseg -t seg_sc_ms_lesion_stir_psir
sct_deepseg -t seg_ms_lesion_mp2rage

plbenveniste commented 1 week ago

I created the file evaluation/test_sct_models.py to evaluate the predictions of the 3 models for lesion seg in SCT.

It computes dice score, lesion ppv, lesion sensitivity and lesion f1 score.

It is currently running to evaluate it on th test set using:

python evaluation/test_sct_models.py --msd-data-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --output-path ~/net/ms-lesion-agnostic/evaluating_existing_models/evaluation_output

plbenveniste commented 1 week ago

Because the initial code was taking too long to compute (aroung 90h), I decided to split it into 3 files:

python evaluation/test_sct_deepseg_lesion.py --msd-data-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --output-path ~/net/ms-lesion-agnostic/evaluating_existing_models/evaluation_sct_deepseg_lesion
python evaluation/test_sct_deepseg_psir_stir.py --msd-data-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --output-path ~/net/ms-lesion-agnostic/evaluating_existing_models/evaluation_sct_deepseg_psir-stir
python evaluation/test_sct_deepseg_mp2rage.py --msd-data-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --output-path ~/net/ms-lesion-agnostic/evaluating_existing_models/evaluation_sct_deepseg_mp2rage

plbenveniste commented 1 week ago

For the sct_deepseg_lesion model

I then plotted the desired curves using:

python evaluation/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/evaluating_existing_models/evaluation_sct_deepseg_lesion/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json  --split test

Output:

Dice score per contrast (mean ± std)
PSIR (n=60): 0.0068 ± 0.0098
STIR (n=11): 0.3676 ± 0.2831
T2star (n=83): 0.5117 ± 0.2076
T2w (n=358): 0.3206 ± 0.2679
UNIT1 (n=57): 0.0070 ± 0.0084

dice_scores_contrast

Here is the output for the other metrics

```console PPV score per contrast (mean ± std) PSIR (n=60): 0.0222 ± 0.1354 STIR (n=11): 0.4864 ± 0.4037 T2star (n=83): 0.6010 ± 0.2895 T2w (n=358): 0.6079 ± 0.4153 UNIT1 (n=57): 0.0097 ± 0.0526 F1 score per contrast (mean ± std) PSIR (n=60): 0.0077 ± 0.0441 STIR (n=11): 0.4037 ± 0.3222 T2star (n=83): 0.6396 ± 0.2281 T2w (n=358): 0.5059 ± 0.3690 UNIT1 (n=57): 0.0088 ± 0.0464 Sensitivity score per contrast (mean ± std) PSIR (n=60): 0.0395 ± 0.1839 STIR (n=11): 0.4500 ± 0.3738 T2star (n=83): 0.8102 ± 0.2478 T2w (n=358): 0.5221 ± 0.4007 UNIT1 (n=57): 0.0085 ± 0.0458 ``` ![f1_scores_contrast](https://github.com/user-attachments/assets/8d9ea969-d603-413c-8a2d-a08e64a192d3) ![ppv_scores_contrast](https://github.com/user-attachments/assets/32ca8c6e-ffc1-4ccb-8239-37c192bc422c) ![sensitivity_scores_contrast](https://github.com/user-attachments/assets/68887091-f328-4a16-8ac1-26a4d5435f07)

For the MP2RAGE model

python evaluation/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/evaluating_existing_models/evaluation_sct_deepseg_mp2rage/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json  --split test

Output:

Dice score per contrast (mean ± std)
PSIR (n=60): 0.2135 ± 0.1760
STIR (n=11): 0.0110 ± 0.0126
T2star (n=83): 0.0074 ± 0.0223
T2w (n=358): 0.0067 ± 0.0127
UNIT1 (n=57): 0.4549 ± 0.1944

dice_scores_contrast

Output for the other metrics:

```console PPV score per contrast (mean ± std) PSIR (n=60): 0.3733 ± 0.2918 STIR (n=11): 0.0000 ± 0.0000 T2star (n=83): 0.0000 ± 0.0000 T2w (n=358): 0.1425 ± 0.3500 UNIT1 (n=57): 0.3298 ± 0.1770 F1 score per contrast (mean ± std) PSIR (n=60): 0.3943 ± 0.2621 STIR (n=11): 0.0000 ± 0.0000 T2star (n=83): 0.0000 ± 0.0000 T2w (n=358): 0.0000 ± 0.0000 UNIT1 (n=57): 0.4422 ± 0.1937 Sensitivity score per contrast (mean ± std) PSIR (n=60): 0.5506 ± 0.3480 STIR (n=11): 0.0000 ± 0.0000 T2star (n=83): 0.0000 ± 0.0000 T2w (n=358): 0.0000 ± 0.0000 UNIT1 (n=57): 0.8224 ± 0.2470 ``` ![f1_scores_contrast](https://github.com/user-attachments/assets/f1486bf5-1b40-4e2b-acde-bbe44ace85dc) ![ppv_scores_contrast](https://github.com/user-attachments/assets/cd2c7400-a383-4cfb-a6fc-4821b8beccb7) ![sensitivity_scores_contrast](https://github.com/user-attachments/assets/f0b55b8d-cf54-4510-9359-a1df4497a7c8)

For the PSIR and STIR model

python evaluation/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/evaluating_existing_models/evaluation_sct_deepseg_psir-stir/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json  --split test

Output:

Dice score per contrast (mean ± std)
PSIR (n=60): 0.5701 ± 0.2660
STIR (n=11): 0.5984 ± 0.2237
T2star (n=83): 0.1312 ± 0.1538
T2w (n=358): 0.2213 ± 0.2134
UNIT1 (n=57): 0.0023 ± 0.0016

dice_scores_contrast

For the other metrics:

```console PPV score per contrast (mean ± std) PSIR (n=60): 0.6672 ± 0.3478 STIR (n=11): 0.6605 ± 0.3430 T2star (n=83): 0.1235 ± 0.1475 T2w (n=358): 0.4306 ± 0.4165 UNIT1 (n=57): 0.0000 ± 0.0000 F1 score per contrast (mean ± std) PSIR (n=60): 0.6381 ± 0.3240 STIR (n=11): 0.6494 ± 0.2915 T2star (n=83): 0.1815 ± 0.1940 T2w (n=358): 0.3392 ± 0.3560 UNIT1 (n=57): 0.0000 ± 0.0000 Sensitivity score per contrast (mean ± std) PSIR (n=60): 0.7138 ± 0.3415 STIR (n=11): 0.7462 ± 0.3294 T2star (n=83): 0.4796 ± 0.4512 T2w (n=358): 0.5556 ± 0.4181 UNIT1 (n=57): 0.0000 ± 0.0000 ``` ![f1_scores_contrast](https://github.com/user-attachments/assets/5a605d96-cafb-4c74-afc8-9c6eb8049c19) ![ppv_scores_contrast](https://github.com/user-attachments/assets/30260cd4-6d59-4739-b17e-de6a7ff85968) ![sensitivity_scores_contrast](https://github.com/user-attachments/assets/deff0629-8a7d-4e25-bf4f-3649fd4814d6)

plbenveniste commented 6 days ago

I then evaluated the SCT models for segmenting spinal lesions on the external testing set (ms-basel-2018 and ms-basel-2020).

For sct_deepseg_lesion

I rand the following command:

python evaluation/test_sct_lesion_external_dataset.py --input-folder ~/net/ms-lesion-agnostic/data --output-path ~/net/ms-lesion-agnostic/evaluating_existing_models/external_evaluation/

Output:


Dice score per contrast (mean ± std)
PD (n=31): 0.0046 ± 0.0114
T1w (n=22): 0.0673 ± 0.2120
T2w (n=24): 0.3272 ± 0.3372

dice_scores_contrast

Here is the output for the other metrics

```console PPV score per contrast (mean ± std) PD (n=31): 0.0613 ± 0.2076 T1w (n=22): 0.1136 ± 0.3060 T2w (n=24): 0.3993 ± 0.3877 F1 score per contrast (mean ± std) PD (n=31): 0.0189 ± 0.0651 T1w (n=22): 0.0657 ± 0.2186 T2w (n=24): 0.4000 ± 0.3717 Sensitivity score per contrast (mean ± std) PD (n=31): 0.0130 ± 0.0461 T1w (n=22): 0.2849 ± 0.4499 ``` ![f1_scores_contrast](https://github.com/user-attachments/assets/3dc470aa-e264-4044-b491-bf8b7868bc2e) ![ppv_scores_contrast](https://github.com/user-attachments/assets/c83bf1b6-ad95-4d79-a9d6-9bd749547dc3) ![sensitivity_scores_contrast](https://github.com/user-attachments/assets/9f6d237f-b510-4d8f-bcb4-b29155f41181)

For sct_deepseg mp2rage

I ran the following command:

python evaluation/test_sct_mp2rage_external_dataset.py --input-folder ~/net/ms-lesion-agnostic/data --output-path ~/net/ms-lesion-agnostic/evaluating_existing_models/external_evaluation/

Output:

Dice score per contrast (mean ± std)
PD (n=31): 0.0034 ± 0.0118
T1w (n=22): 0.0559 ± 0.2116
T2w (n=24): 0.2864 ± 0.4308

dice_scores_contrast

Here is the output for the other metrics

```console PPV score per contrast (mean ± std) PD (n=31): 0.0000 ± 0.0000 T1w (n=22): 0.0455 ± 0.2132 T2w (n=24): 0.2500 ± 0.4423 F1 score per contrast (mean ± std) PD (n=31): 0.0000 ± 0.0000 T1w (n=22): 0.0455 ± 0.2132 T2w (n=24): 0.2500 ± 0.4423 Sensitivity score per contrast (mean ± std) PD (n=31): 0.0000 ± 0.0000 T1w (n=22): 0.2727 ± 0.4558 ``` ![f1_scores_contrast](https://github.com/user-attachments/assets/04e38d19-21b4-4c19-b9fa-c55e198569fa) ![ppv_scores_contrast](https://github.com/user-attachments/assets/be9505b1-62a3-4e0c-828b-7939e3ca1e78) ![sensitivity_scores_contrast](https://github.com/user-attachments/assets/e9b18e8c-4ec1-4eb4-b81d-7f8a231fe29e)

For sct_deepseg psir-stir

I ran the following command

python evaluation/test_sct_psir-stir_external_dataset.py --input-folder ~/net/ms-lesion-agnostic/data --output-path ~/net/ms-lesion-agnostic/evaluating_existing_models/external_evaluation/

Output:


Dice score per contrast (mean ± std)
PD (n=31): 0.0036 ± 0.0119
T1w (n=22): 0.2774 ± 0.4529
T2w (n=24): 0.2510 ± 0.3996

dice_scores_contrast

Here is the output for the other metrics

```console PPV score per contrast (mean ± std) PD (n=31): 0.0000 ± 0.0000 T1w (n=22): 0.2727 ± 0.4558 T2w (n=24): 0.2792 ± 0.4128 F1 score per contrast (mean ± std) PD (n=31): 0.0000 ± 0.0000 T1w (n=22): 0.2727 ± 0.4558 T2w (n=24): 0.2812 ± 0.4154 Sensitivity score per contrast (mean ± std) PD (n=31): 0.0000 ± 0.0000 T1w (n=22): 0.2727 ± 0.4558 ``` ![f1_scores_contrast](https://github.com/user-attachments/assets/6fe18614-53d5-4f05-b3b5-47c25f3e0c98) ![ppv_scores_contrast](https://github.com/user-attachments/assets/8e114f7f-7de5-4fd9-b762-b249ec5a2dc3) ![sensitivity_scores_contrast](https://github.com/user-attachments/assets/77c30a30-78b0-4cc6-bc44-1665f4e80796)

ivadomed / ms-lesion-agnostic