mikevoets / jama16-retina-replication

JAMA 2016; 316(22) Replication Study
https://doi.org/10.1371/journal.pone.0217541
MIT License
110 stars 37 forks source link

Making trained models available #11

Closed susmariababy closed 4 years ago

susmariababy commented 5 years ago

Hi,

Since all this training and replication takes a lot of resources and time, is it possible to make the learned models available to the public ?

Thanks

mikevoets commented 5 years ago

Yes, I will definitely look into that soon. Do you have any recommendations regarding repositories where I can upload model files?

larsab commented 5 years ago

We could use UiT's open data portal (https://dataverse.no/dataverse/uit) or figshare (https://figshare.com/)

mikevoets commented 5 years ago

@susmariababy We have uploaded the model files here: https://doi.org/10.6084/m9.figshare.8312183. We describe in our readme how to run evaluation with these models: https://github.com/mikevoets/jama16-retina-replication#evaluation.

Have fun!

sdsawtelle commented 5 years ago

Regarding the pretrained models downloadable at https://doi.org/10.6084/m9.figshare.8312183, is it correct that they are meant to be run on a GPU?

I downloaded model-1 and set up a pipeline following your code to evaluate it on a small test set of 10 images on my machine (which only has CPU).

Initially in evaluate.py I received the error: ValueError: Cannot feed value of shape (10, 299, 299, 3) for Tensor 'x:0', which has shape '(?, 3, ?, ?)'

It seemed like the graph was expecting channels-first, so as a hack I forced the image transpose in _parse_example function of lib/dataset.py. This allowed the input to the model but returned the error: InvalidArgumentError (see above for traceback): Default MaxPoolingOp only supports NHWC on device type CPU [[Node: max_pooling2d/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"](activation_2/Relu)]]

So again it seemed like the graph was set up to run on GPU (with data_format="NCHW")?

I am very new to tf + keras and CNNs in general, so apologies if the above is completely off base. Do you think it would be reasonable (or possible?) for me to modify the model-1 graph to work with channels-last on CPU?

mikevoets commented 5 years ago

Usually it should not be a problem to evaluate on CPU when the model graph has been trained on GPU(s). The issue here was that we used tf.keras.backend.set_image_data_format(image_data_format) in evaluate.py. This overrides the image data format for the loaded model graph in Keras internally to channels_last when there are no GPUs available on the machine running evaluate.py. This is wrong because the model has been trained on GPUs - with channels_first.

I have pushed a fix, and make sure now in the code that the setting is always channels_first, so this is no longer an issue with the latest source code. I also removed the unnecessary call to Keras backend to avoid overriding the image data format setting.

fbmoreira commented 5 years ago

I evaluated all the available models individually, and one ensemble consisting of the ten models. The ensemble got the best result with 0.90763158 AUC. Were these NNs the same you used in the paper? If so, what could be the source of difference that got you the reported 0.94 AUC? Perhaps a more selective ensemble?

mikevoets commented 5 years ago

Yes, they were the same as we used in the paper.

What is the exact command you used to evaluate with an ensemble?

fbmoreira commented 5 years ago

python3 evaluate.py -e -lm ../MVNETS/model-1,../MVNETS/model-2,../MVNETS/model-3,../MVNETS/model-4,../MVNETS/model-5,../MVNETS/model-6,../MVNETS/model-7,../MVNETS/model-8,../MVNETS/model-9,../MVNETS/model-10 -so ../MVNETS/eyepacs-pointsensemble2.csv

Numpy version: 1.16.4 Tensorflow version: 1.13.1

fbmoreira commented 5 years ago

tensorflow version is different, since our server got a CUDA update and I had to rebuild it. A lot of code became deprecated and some of the save/restore functions did not work for train.py, but commenting out the code to reload the model and generate the operating points solved that.

The most relevant thing, and possible explanation: seed is the same as default (42), BUT There is a small chance that the image distribution has been done with a different seed (since I created a new folder with 500x500 pics and had kinda ditched this folder). Let me check that before I further annoy you with this :P

fbmoreira commented 5 years ago

Finished checking, the distribution was correct, I used exactly the same images using default seed. That leaves a possible tensorflow change as responsible for the difference, or some undetected fault in the images (missing images perhaps? I followed the instructions to obtain the eyepacs images).

I'll see if I can allocate a clean machine from our cluster and build tf 1.12 (I think that's your version?) and reproduce in it to check that hypothesis later on this week.

mikevoets commented 5 years ago

@fbmoreira I tested some weeks ago with both tf 1.12.0 and 1.13.1 and still got the same results (0.94 AUC on the EyePacs test set) when I ran evaluate.py with the models trained with earlier versions of tf. I had some deprecation errors, but none of these errors were fatal. So I did not have to modify any function calls to make evaluate.py work with tf >= 1.12.

I think that you may have some differences in your EyePacs test set. Are you certain the images are the same as the images you would get after running the eyepacs.sh script with default parameters?

If you still have the ./data/eyepacs/pool folder containing all Kaggle EyePacs images (88K images in total), you can rebuild the test set to another folder by running the eyepacs.sh script by running it with the --redistribute parameter, and --output_dir=/path/to/unpack. Then you can specify the last folder when running evaluate.py: python evaluate.py -e -lm=<all the models> --data_dir=/path/to/unpack/test (note /test after /path/to/unpack from running the eyepacs.sh script).

Hope this helps.

fbmoreira commented 5 years ago

I already did --redistribute with the default parameters. I have also checked the number of images, I have 88892 images: 65337 class 0, 6203 class 1, 13151 class 2, 2087 class 3, and 1941 class 4. I have not been able to allocate any other machine (and the current one is working on higher resolution images), and I won't be for a while due to deadlines from other students... I wonder if someone else has tried to evaluate and was able to do so?

mikevoets commented 4 years ago

@fbmoreira Have you been able to find out the issue yourself? Can I help you with anything else?

fbmoreira commented 4 years ago

Unfortunately, I have not. I'll make a machine reservation to try a brand-new installation, but it might take a while since we are under no-breaks maintenance this week.

fbmoreira commented 4 years ago

First run on a clean machine, split into bin2 using --only_gradable (a difference from last time, as I suspected it could be a potential source of difference): Brier score: 0.08247, AUC: 0.95840889 Confusion matrix at operating threshold 0.500 [[ 243 35] [ 451 8061]] Specificity: 0.9957, Sensitivity: 0.3501 at Operating Threshold 0.5000.

That's even better than the reported in the paper! I'll redistribute without the only_gradable flag now and see if it's still short of the AUC in the paper.

p.s.: do you plan on updating the code to TF 2.0?

fbmoreira commented 4 years ago

Second run, now with all images: Brier score: 0.1414, AUC: 0.93264472 Confusion matrix at operating threshold 0.500 [[ 681 7] [2501 8089]]

Neither results quite hit the mark on your AUC, which is a bit odd. Perhaps they updated the kaggle dataset?

mikevoets commented 4 years ago

Good to hear that the AUC is more or less like the AUC reported in our paper.

The AUC in our paper reports the linearly average AUC of an ensemble of 10 models. So individual models will have a varying AUC. Did you try creating an ensemble?

fbmoreira commented 4 years ago

Yes, I used the same command as previously reported for both runs: python3 evaluate.py -e -lm ../MVNETS/model-1,../MVNETS/model-2,../MVNETS/model-3,../MVNETS/model-4,../MVNETS/model-5,../MVNETS/model-6,../MVNETS/model-7,../MVNETS/model-8,../MVNETS/model-9,../MVNETS/model-10 -so ../MVNETS/eyepacs-pointsensemble3.csv

mikevoets commented 4 years ago

Hmm, strange. Can you check if the rDR distribution in your evaluation data set is the same as reported here: 8790 images in total in the subfolders under ./data/eyepacs/bin2/test, with 7.9% of those images having rDR and located in subfolder /1?

fbmoreira commented 4 years ago

bin2, distributed with --only_gradable: 8790 images. 8096 images in subfolder /0, 694 images in subfolder /1 (7.895%, so yea, 7.9%). bin5, distributed with all images: 11278 images. 8096 images in subfolder /0, 3182 images in subfolder /1 (28.214%)

Basically, this is what is confusing: you have mentioned using all images to train the models (even in the above posts you assume default values for eyepacs.sh, so a new user is led to simply run eyepacs.sh with default values. But if your results are to be reproduced, 2 executions of eyepacs.sh are needed: one to generate the training set with all images, and another to generate the clean test set composed only of gradable images.

A possible correction in the script would be to always generate the test set by removing the ungradable images.

mikevoets commented 4 years ago

In the most recent version of our paper we report an AUC of 0.951 for evaluation of the Kaggle test set that the eyepacs.sh script generates (with default values), after training an ensemble with all images (i.e. including ungradable images).

So to reproduce the results, eyepacs.sh should be run without the --only_gradable flag. Not sure why this gives you an AUC of 0.93 while it should give you an AUC of 0.95, but it may be that the Kaggle data has been modified. I will check this and come back to you.

mikevoets commented 4 years ago

I re-downloaded the Kaggle data set, ran the eyepacs.sh script without flags, and re-downloaded the neural network models from Figshare. The evaluate.py script gave the following output:

→ python evaluate.py -e -lm=./tmp/models/model-\*                     [15f2af3]
Numpy version: 1.15.1
Tensorflow version: 1.12.0

Evaluating: ./data/eyepacs/bin2/test,
Saving operating thresholds metrics at: ./tmp/test_op_pts.csv,
Using operating treshold: 0.5,

Trying to load model(s):
./tmp/models/model-7
./tmp/models/model-5
./tmp/models/model-10
./tmp/models/model-6
./tmp/models/model-9
./tmp/models/model-2
./tmp/models/model-3
./tmp/models/model-8
./tmp/models/model-1
./tmp/models/model-4
Brier score: 0.08418, AUC: 0.95067859
Confusion matrix at operating threshold 0.500
[[ 228   29]
 [ 466 8067]]
Specificity: 0.9964, Sensitivity: 0.3285 at Operating Threshold 0.5000.

The models still perform with an AUC of 0.951 on the current Kaggle test data set, as expected. This means that there still is something different in your set-up compared to the standard set-up.

Looking back to what you wrote regarding data distribution in the Kaggle test set:

bin5, distributed with all images: 11278 images. 8096 images in subfolder /0, 3182 images in subfolder /1 (28.214%)

This does not sound right. What do you mean with bin5? In the standard set-up, the Kaggle test set should reside in ./data/eyepacs/bin2/test and have a folder 0 with 8096 jpg images, and another folder 1 with 694 images. The Kaggle test set should have this image distribution regardless of how you run eyepacs.sh.

fbmoreira commented 4 years ago

Detailing the previous post: I first executed a distribution with the "--only_gradable" flag, thus the images went to '/bin2'. Then reran eyepacs.sh with --redistribute and --output-dir=data/eyepacs/bin5, so I could have both directories (bin2 with --only_gradable and bin5 without --only_gradable) and compare the resulting image datasets.

The Kaggle test set has different distributions depending on the --only_gradable flag, as shown in eyepacs.sh lines 169-178:

# Distribution numbers for data sets with ungradable images. if echo "$@" | grep -c -- "--only_gradable" >/dev/null; then bin2_0_cnt=39202 bin2_0_tr_cnt=31106 bin2_1_tr_cnt=12582 else bin2_0_cnt=48784 bin2_0_tr_cnt=40688 bin2_1_tr_cnt=16458 fi

so, in my head, that's why the datasets for bin2 and bin5 are different. And the smaller dataset was obtained with --only_gradable in bin2, which makes sense as --only_gradable is used to remove ungradable images...

Now, why you get the same amount of images when running WITHOUT "--only_gradable" that I get when running WITH "--only_gradable" puzzles me. What's the distribution of your test dataset when you run eyepacs.sh WITH "--only_gradable"?

mikevoets commented 4 years ago

The Kaggle test set has different distributions depending on the --only_gradable flag, as shown in eyepacs.sh lines 169-178:

Nope, the distribution in the test set should stay the same regardless of the flag. In the pool with ungradable images unremoved (which you would get after running the script without flags), the amount of images in bin2/test/0 is bin2_0_cnt - bin2_0_tr_cnt = 48784 - 40688 = 8096. The amount of images in bin2/test/1 is the amount of images in subfolder pool/2, pool/3 and pool/4 find pool/[2-4] -iname "*.jpg" | wc -l => 17152 minus bin2_1_tr_cnt (16458), which equals to 694. The same applies for the test set from the pool with ungradable images removed, but here the amount of images in subfolders pool/2 to 4 is 13276. bin2_0_cnt - bin2_0_tr_cnt = 39202 - 31106 = 8096 and 13276 - 12582 = 694.

Back to your issue: What has happened here is that you ran eyepacs.sh --only_gradable, which unpacks the images in the pool folder, then removes all ungradable images from that pool. Afterwards when you run eyepacs.sh --redistribute --output_dir=/other/path, it uses the pool the same pool; i.e. with only gradable images. Then the distributions will be wrong because the script assumes that the pool includes ungradable images, and the distributions will become as you wrote: 8096 images in subfolder 0, 3182 images in subfolder 1, because bin2_1_cnt - bin2_1_tr_cnt = 13276 - 16458 = -3182 (and this will be interpreted as 3182 in the tail command that limits the amount of images).

So your pool with only gradable images cannot be used to generate test sets including ungradable images. If you still have the Kaggle data set zip files in data/eyepacs, you may recreate a pool with ungradable images and distribute the data sets correctly by running the script like this: ./eyepacs.sh --pool_dir=data/eyepacs/pool_incl_ungradable --output_dir=data/eyepacs/bin2_incl_ungradable.

It's tedious that you have to run the script like that again as it can take hours, sorry for that. But right now there is no other way to achieve the correct data set distribution with ungradable images by effectively re-running the script and assigning the images to a new pool.

fbmoreira commented 4 years ago

Ok, thank you for your attention, I think we can close this issue