mahmoodlab / CLAM

Open source tools for computational pathology - Nature BME
http://clam.mahmoodlab.org
GNU General Public License v3.0
1.13k stars 366 forks source link

evaluation script vs evaluation during training #286

Open mhoibo opened 1 week ago

mhoibo commented 1 week ago

Hi, Thank you for your great work. I have tried to use clam in a classification task, but have seen that when I train, the validation loss decreases and the accuracies are fairly good (the final values) but when I evaluate with eval.py (on the same splits on the evaluation set ), then get completely different auc and accuracies. Do you know why that could be?

The summary file in the results, tensorboard and the .pkl files created during training, are all the same, but when I reload the model checkpoints with the eval.py script I get very different results on the same data. For train and eval.py I used values like seen below: (example)

CUDA_VISIBLE_DEVICES=1 python main.py --drop_out 0.25 --max_epochs 100 --early_stopping --lr 2e-4 --k 10 --exp_code <model name> --results_dir <path to save> --weighted_sample --bag_loss ce --inst_loss svm --task <task> --model_type clam_sb --log_data --data_root_dir <path to features> --embed_dim 1024

python eval.py --k 10 --models_exp_code <model name> --save_exp_code <path to to save eval results> --task task --model_type
clam_sb --results_dir <path to training results> --data_root_dir <path to features> --embed_dim 1024 --drop_out 0.25 --model_size small --split val

I appreciate all types of pointers to how to solve or understand this.

I checked that the splits are the same. I also checked that during training the checkpoints are not updated for all epochs, so the best model should be saved in the checkpoints, which are then used for both evaluation during training and in eval.py, am I correct?