Open agitter opened 7 years ago
~100k training images pulled from structured EMRs (might be a helpful connection of these two parts of the categorize
section).
The section "Patient and Image Selection" describes the automated queries they performed to extract images that they expected would show macular pathology in both eyes.
The authors created an independent validation set using 20% of the patients in each group. Regarding @agitter's point:
Care was taken to ensure that the validation set and the training set contained images from mutually exclusive group of patients (i.e. no single patient contributed images to both the training and validation set).
This, however, confuses me:
At each iteration, the loss of the model was recorded, and at every 500 iterations, the performance of the neural network was assessed using cross-validation with the validation set. The training was stopped when the loss of the model decreased and the accuracy of the validation set decreased.
I have two guesses here:
I'm leaning towards 2 but will send the authors an email to this thread to see if we can get clarification.
Argument for why we should discuss this: it connects the value of EHR systems with imaging data in the context of deep neural networks. Thus it is central to our review.
Argument for why we might not want to: performance metrics are difficult to interpret if the validation set has some contamination due to multiple evaluations against it. We may not want to discuss if we're worried about the validity of the results.
For the present time, I'm going to table this paper because of the concerns. I'll email the authors a link to this and see if we can get some clarity.
Ok - email sent! I am hoping that we can get some clarity so that we can determine how to best discuss this work.
I have received a reply. Asking for permission to post it.
Got permission to posting it. Pulling out the two parts from Aaron that are pertinent to this discussion:
We followed what we have seen done in methodology in the computer vision world specifically related to ImageNet. Specifically we noted that in the AlexNet paper (https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) they noted a less than 0.1% difference between the test and validation sets, and for the paper they used the rates interchangeably. We noted this trend in several other deep learning computer vision papers. We did not have a separate test sets because we believed it would be very difficult to create an idiosyncratic dependency in real life data that would substantially affect our results.
Our paper has been accepted to Ophthalmology Retina after peer review and is currently in print. I wish I had received your comments earlier as I could have run another test set made from patients separated in time from the original data.
A draft discussion of this work is in #167
https://doi.org/10.1101/094276
Need to look carefully at how they decided to stop training. Were there separate validation and test sets?