mikevoets / jama16-retina-replication

JAMA 2016; 316(22) Replication Study
https://doi.org/10.1371/journal.pone.0217541
MIT License
110 stars 37 forks source link

0.94 AUC not reproducible #12

Open fbmoreira opened 5 years ago

fbmoreira commented 5 years ago

Following your README step by step and creating several models, only one model has only achieved 0.76 AUC for EyePACS so far. It's not clear to me whether the reported AUC of 0.94 used a single model or an ensemble... I'll try an ensemble, but most of the models I am running end with 0.52~ AUC, which means they are likely not contributing much.

Are there any undetailed reasons for the code to not reproduce the paper results? Maybe a different seed for the distribution of images in the folders? I used the --only_gradable flag, it's also not clear whether your paper used all images or only the gradable ones.

Thank you!

mikevoets commented 5 years ago

When it comes to AUC, we also experienced fluctuating results. We have for example experienced that it stopped at 0.60 AUC sometimes. Please run it a couple more times to find better results. This code is directly related to the results in our paper without having made any modifications. For the latest version of our paper we have used all images.

The original paper proposes evaluating the linear average of predictions with an ensemble of 10 trained models. To create an such an ensemble of trained models from the code of this repo, use the -lm parameter. To specify an ensemble, the model paths should be comma-separated or satisfy a regular expression. For example: -lm=./tmp/model-1,./tmp/model-2,./tmp/model-3

fbmoreira commented 5 years ago

Do you believe running eyepacs.sh --redistribute before training new models could result in more varied models and thus a better ensemble?

Thank you for your answer!

mikevoets commented 5 years ago

Yes, in combination with applying a different seed with the --seed parameter. Otherwise there will be no difference between the distributions.

In our study we did not redistribute though. We only distributed once with the default seed in the script, and all our models and ensemble are created from training with that image distribution.

fbmoreira commented 5 years ago

Sorry for not providing a faster reply, but the maximum AUC I got in my tests was 0.90 using the evaluate.py on eyepacs with the default seed. I suppose eventually I can reach 0.94 or higher, but I think it will be better to move to an ensemble model to get better results quicker.

Sayyam-Jain commented 5 years ago

Sorry for not providing a faster reply, but the maximum AUC I got in my tests was 0.90 using the evaluate.py on eyepacs with the default seed. I suppose eventually I can reach 0.94 or higher, but I think it will be better to move to an ensemble model to get better results quicker.

Can you please help me on how did you achieve similarl results? Thanks

fbmoreira commented 5 years ago

I trained about 10 models with the default seed, and tried to use evaluate.py on the ones that got the higher AUC during the training cross-validation. One of the models, which had 0.77 AUC during cross-validation, gave me 0.90 AUC in the execution of ./evaluate.py -e...

So I do not have any specific tip, just run more models until you get something good :P

mikevoets commented 5 years ago

@fbmoreira You experience exactly the same as we experienced when we trained our models. Some are bad (around 0.70 AUC), most of them are ok-ish (~ 0.85 AUC), some are better and exceed 0.90 AUC on evaluation. What we learned is that using the power of all these models together (both bad and better models) in an ensemble always yields a better result.

Sayyam-Jain commented 5 years ago

@fbmoreira Apologies for being dumb (still new to Deep Learning) but can you please explain training different models. Did you use different neural network architectures or something else. Please explain.

mikevoets commented 5 years ago

@Sayyam-Jain When you run python train.py twice, you'll train two different models. That's because the weights of the network are initialized different (randomly) every time. Because the starting points of the network's parameters are different, it essentially leads to a different network with different results every time. The neural network architecture is still the same. Hope this explains it well.

@fbmoreira NB: The random seed random.seed(432) is not intended to set a fixed initialization of weights here, and that's why you get different results every time, which is as intended. The random seed here is meant to set a fixed shuffling order of various data augmentations, here: https://github.com/mikevoets/jama16-retina-replication/blob/master/lib/dataset.py#L42.

fbmoreira commented 5 years ago

I didn't say a thing about fixed initialization o_O I knew it was only for the dataset partition since it was in the eyepacs.sh script, and had nothing to do with the network itself. I did read your augmentation code, I found it curious that you did not perform vertical flips as well.

Reading your code it was clear to me you initialized the inception v3 model with imagenet weights and I assume the only (small) random weight there is in the top layer initialization.

I think that your results omitting the --only-gradable were better because the noise introduced might have helped the network to better generalize the problem, hence your higher AUC. Another thing that might help in the future is to introduce gaussian noise/"pepper-and-salt" as a form of augmentation, although the size of the hemorrhages and microaneurisms might be indistinguishable from the noise, so I am not sure.

mikevoets commented 5 years ago

Ah ok, excuse me for my misunderstanding!

Regarding vertical flips in data augmentation: the objective of this project was to try to replicate the model and reproduce the results made in this paper. Since the team in that paper did not vertically flip images in their augmentation, we did not either.

Regarding your last point: it seems definitely likely that the noise in non-gradable images improves the generalization and reduces the chance of overfitting. I am however not sure how big the effects are of training the network with wrong labels for those non-gradable images. I still lean towards the option of using only gradable images and applying random data augmentation to them, but during our project we did not test if this actually leads to better results.

slala2121 commented 4 years ago

Using the ensemble of pretrained models, I get AUC of 0.91 on the test dataset rather than 0.95. I followed the instructions in downloading the dataset. Should I be getting 0.95? Does something need to be changed?

mikevoets commented 1 year ago

Hey @slala2121, just to confirm, did you download the models from here https://figshare.com/articles/dataset/Trained_neural_network_models/8312183? Also, what Tensorflow version and Python did you run with?