Testing on other data sets

Tonthatj commented 5 years ago

Hi, so I was testing the model against another dataset of mammos, and was wondering if the inputted image dimensions have to be exact? Your sample cropped photos: (2440X3656) & (2607X3818) and of ours (1993X4396) & (2133X4906).

P_00005,RIGHT, CC, MALIGNANT, 0.1293 ,0.0123 P_00005,RIGHT,MLO, MALIGNANT ,0.1293, 0.0123 P_00007, LEFT, CC, BENIGN ,0.3026, 0.1753 P_00007,LEFT,MLO, BENIGN, 0.3026, 0.1753

As you can see the probabilities for benign and malignancy (respectively) are incredibly low, Out of a dataset of 200, the model only accurately predicted ~ 10 of them.

kjgeras commented 5 years ago

@Tonthatj If I understand you correctly, this is what you should expect. This classifier is trained with a very imbalanced data set, only a small fraction of the training examples contain a malignancy. This is skewing the classifier towards predicting very low probability of malignancy. What this classifier should be good at is distinguishing between malignant and not malignant cases (which is captured by AUC). It will not necessarily provide accurate probability estimates.

Tonthatj commented 5 years ago

So in what is the threshold for determining if the study as a whole was malignant or not?

kjgeras commented 5 years ago

There is no universally "correct" threshold. It depends on the dataset you are planning to apply it to. One way you can pick a sensible threshold using validation data in the following. Assuming that you expect p% of cancers in your dataset, sort the validation examples according to the predicted probability of malignancy, pick top p%, the lowest estimated probability of malignancy in that set is your threshold.

kjgeras commented 5 years ago

By the way, I'm not sure if I'm interpreting the numbers in the first post above correctly, but they look weird to me. If you are getting some strange results such as AUC = 0.5, it's probably because you are preprocessing the data differently than we did. Look at the tech report if that is the case: https://cs.nyu.edu/~kgeras/reports/datav1.0.pdf

Tonthatj commented 5 years ago

Hi @kjgeras , we followed all of your preprocessing instructions, with the exception that our dicom images have a larger dimension. Do you think that this is the reason for the poor AUC? All of the cropped image resolutions are larger than the specified resolution of 2290 × 1890 that you mentioned in tech report.

kjgeras commented 5 years ago

It's difficult to answer this question based on what you wrote. You have to do the preprocessing and image normalization exactly the same way as we do. If we differ in just one detail, you are going to get random predictions.

Tonthatj commented 5 years ago

@kjgeras I have Dicoms with resolutions of 3016 X 4616. Would I have to resize them to be 1942 X 2677 and then run your preprocessing to get an accurate result? Can I not use dicom images unless they match exactly to the resolution you used?

kjgeras commented 5 years ago

It is hard to say. I think we didn't have any images, which were that large in our data set, but that doesn't mean that it necessarily wouldn't world. I would suggest the following debugging strategy:

If the AUC you are getting for your dataset is relatively low but clearly not random (i.e. >0.6 and <0.8), then it is possible that the difference in performance is coming from some changes in the distribution of the data to which our model is might not be robust to. The difference might be the size of the images, contrast, digital vs. not digital mammography, difference in the definition of the labels etc.: there are multiple different options here. If different problems like this accumulate, it can degrade performance. The good news in that case is that, you could retrain our model with your data (even if your dataset is relatively small) to fix it.
If the AUC is really low (i.e. < 0.6), and the predictions look strange (e.g. are always the same, regardless of an example), the problem is coming almost certainly from a difference in preprocessing or normalization. You need to check every little detail to make sure that you are doing it exactly the way we did it.

Tonthatj commented 5 years ago

Just double checking, You cropped your dicom images before running them through crop_mammogram.py?

jpatrickpark commented 5 years ago

We do two stages of cropping. 1. We remove background from dicom images in order to improve loading time, which is done by crop_mammogram. 2. And from the cropped images of any arbitrary size after first stage, we further crop (or pad) image to be specific size (2642x1977 or 2974x1748) in order to feed the model (which is done by data_loading.augmentation.random_augmentation_bedt_center). So it is okay if your dicom files have different resolution. Please make sure you are following both stages in the right order.

Tonthatj commented 5 years ago

So do the png files produced by crop_mammo.py need to be of specific size (2642x1977 or 2974x1748) to get an accurate result? Or as long as they are all of 2 specific sizes it should provide reasonable result.

jpatrickpark commented 5 years ago

There's no requirement of image size for files produced from crop_mammogram.

jpatrickpark commented 5 years ago

Please feel free to run the classifiers again now that we updated the source code.

Tonthatj commented 5 years ago

Hey, I am trying to perform transfer learning on another dataset. Unfortunately I am running into a problem where I have the results of the prediction and the corresponding truth values in numpy arrays, and therefore cannot use the torch loss functions. When creating my own cross entropy function, I cannot use .backwards on the loss that I computed.

Is it possible for you to share your training code. I would like to look at how you calculate the loss.

zphang commented 5 years ago

Could you post the code you're running/error message you're getting?

nyukat / breast_cancer_classifier

Testing on other data sets #9