Closed jonasboh closed 1 year ago
Hi @jonasboh and thank you for opening this issue!
In fact, the inference has changed in the meantime. And you're right that the WORC example is incomplete regarding evaluation and inference.
I will get back to you later today with an extended tutorial that will include an evaluation of the test subset and inference methods for new data.
Thanks, Piotr
Hey @jonasboh
I've pushed changes with #31 that extend the WORC example, could you check if that is what you need? They include the evaluation of each split (by default the full dataset is split into training, validation, and test sets, the idea is to train using the training set, select the best model and preprocessing on the validation set, and the test set is left for a final unbiased evaluation of the model.
The other option is to run an inference of a whole trained pipeline (including feature extractor, preprocessor, and model) on separate, new data. Here currently you'd need to provide paths to the image and segmentation.
In order for the updated notebook to work, you'd need to install autorad from source, with pip install -e .
from the repo root directory.
Hi @pwoznicki
thank you for the correction. It looks very much better and explaining. I set a treshold for the waterfall plot and then it was also performing nicely.
I was just wondering how I can reproduce the performance of the paper on this Datasets as it is the same dataset (WORC). I did the train_with_cross_validation_test setting with the entire WORC dataset. E.g. for Desmoid I think I got 0.75 AUROC on the Test set with the example but you got 0.78 on CV and 0.9 on Test. I guess this needs some more intense optimization?
Hi @jonasboh
tbh, sample size in WORC is fairly small; hence the choice of the random seed and hence the actual split between train/val/test set could vary rather much between your and our experiments - even with the same source code. If you want to dive more into this effect I recommend "Medical Risk Prediction Models With Ties to Machine Learning" by Thomas A. Gerds, Michael W. Kattan
We could maybe fix (@pwoznicki do we already?) the random seed for the notebooks to make the results reproducable..
Kind regards
Hey @jonasboh,
glad to hear the previous fix helped.
The current version should perform no worse and possibly better than the version used in our 2022 paper (it includes a few fixes and improvements). That said, it will be hard to reproduce the exact results from the paper with it, because of different splits and those few fixes. However, we had a second repository dedicated to the experiments from the paper, I will make it public and let you know :) It has instructions and uses an appropriate version of AutoRadiomics from April 2022.
@laqua-stack regarding seed, Trainer class takes it as an argument on init, from my testing it seemed it should be enough to make the whole pipeline reproducible. Would be great if at some point you could test if that's the case.
As a follow-up on my previous answer, the exact experiments for the 2022 paper can now be found in: https://github.com/pwoznicki/radiomics-benchmark
I hope that helps, feel free to open an issue in there in case something is unclear.
Thanks
Hello,
I just noticed that your small example in the Supplement referred to as Figure S1 does not work as the Inferrer function has been modified in the meantime.
I was just wondering how we can evaluate the best-trained model on the test data. as this might not be included in the WORC example.
Best, Jonas