greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.24k stars 271 forks source link

Deep learning is effective for the classification of OCT images of normal versus Age-related Macular Degeneration #158

Open agitter opened 7 years ago

agitter commented 7 years ago

https://doi.org/10.1101/094276

Objective: The advent of Electronic Medical Records (EMR) with large electronic imaging databases along with advances in deep neural networks with machine learning has provided a unique opportunity to achieve milestones in automated image analysis. Optical coherence tomography (OCT) is the most commonly obtained imaging modality in ophthalmology and represents a dense and rich dataset when combined with labels derived from the EMR. We sought to determine if deep learning could be utilized to distinguish normal OCT images from images from patients with Age-related Macular Degeneration (AMD). Design: EMR and OCT database study Subjects: Normal and AMD patients who had a macular OCT. Methods: Automated extraction of an OCT imaging database was performed and linked to clinical endpoints from the EMR. OCT macula scans were obtained by Heidelberg Spectralis, and each OCT scan was linked to EMR clinical endpoints extracted from EPIC. The central 11 images were selected from each OCT scan of two cohorts of patients: normal and AMD. Cross-validation was performed using a random subset of patients. Receiver operator curves (ROC) were constructed at an independent image level, macular OCT level, and patient level. Main outcome measure: Area under the ROC. Results: Of a recent extraction of 2.6 million OCT images linked to clinical datapoints from the EMR, 52,690 normal macular OCT images and 48,312 AMD macular OCT images were selected. A deep neural network was trained to categorize images as either normal or AMD. At the image level, we achieved an area under the ROC of 92.78% with an accuracy of 87.63%. At the macula level, we achieved an area under the ROC of 93.83% with an accuracy of 88.98%. At a patient level, we achieved an area under the ROC of 97.45% with an accuracy of 93.45%. Peak sensitivity and specificity with optimal cutoffs were 92.64% and 93.69% respectively. Conclusions: Deep learning techniques achieve high accuracy and is effective as a new image classification technique. These findings have important implications in utilizing OCT in automated screening and the development of computer aided diagnosis tools in the future.

Need to look carefully at how they decided to stop training. Were there separate validation and test sets?

cgreene commented 7 years ago

~100k training images pulled from structured EMRs (might be a helpful connection of these two parts of the categorize section).

The section "Patient and Image Selection" describes the automated queries they performed to extract images that they expected would show macular pathology in both eyes.

The authors created an independent validation set using 20% of the patients in each group. Regarding @agitter's point:

Care was taken to ensure that the validation set and the training set contained images from mutually exclusive group of patients (i.e. no single patient contributed images to both the training and validation set).

This, however, confuses me:

At each iteration, the loss of the model was recorded, and at every 500 iterations, the performance of the neural network was assessed using cross-validation with the validation set. The training was stopped when the loss of the model decreased and the accuracy of the validation set decreased.

I have two guesses here:

  1. The authors trained using cross-validation (this & the abstract are the only times cross validation is mentioned, so I think this might not be the case...). The independent validation set is retained for subsequent use and not accessed at this stage.
  2. The nominal validation set is used for a stopping criterion. This may compromise its independence to some degree. In practice, the effect may be minor, but in any case it is impossible to judge. It would be nice to see results with a different experimental design that allows us to precisely measure performance.

I'm leaning towards 2 but will send the authors an email to this thread to see if we can get clarification.


Argument for why we should discuss this: it connects the value of EHR systems with imaging data in the context of deep neural networks. Thus it is central to our review.

Argument for why we might not want to: performance metrics are difficult to interpret if the validation set has some contamination due to multiple evaluations against it. We may not want to discuss if we're worried about the validity of the results.

For the present time, I'm going to table this paper because of the concerns. I'll email the authors a link to this and see if we can get some clarity.

cgreene commented 7 years ago

Ok - email sent! I am hoping that we can get some clarity so that we can determine how to best discuss this work.

cgreene commented 7 years ago

I have received a reply. Asking for permission to post it.

cgreene commented 7 years ago

Got permission to posting it. Pulling out the two parts from Aaron that are pertinent to this discussion:

We followed what we have seen done in methodology in the computer vision world specifically related to ImageNet. Specifically we noted that in the AlexNet paper (https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) they noted a less than 0.1% difference between the test and validation sets, and for the paper they used the rates interchangeably. We noted this trend in several other deep learning computer vision papers. We did not have a separate test sets because we believed it would be very difficult to create an idiosyncratic dependency in real life data that would substantially affect our results.

Our paper has been accepted to Ophthalmology Retina after peer review and is currently in print. I wish I had received your comments earlier as I could have run another test set made from patients separated in time from the original data.

cgreene commented 7 years ago

A draft discussion of this work is in #167