Inconsistent results on DDSM testset

taijizhao commented 6 years ago

Hello, Li, First congratulations for your excellent work and thank you a lot for sharing the code. It's really helpful for people like me who starts to work on mammography. But when I ran a simple test of your trained whole image model on the DDSM test set, I got auc scores much lower than reported. I used the CBIS-DDSM dataset, convert all images to PNG and resized to 1152*896. Then I used the official testset (CalcTest and MassTest), made "MALIGNANT" positive, "BENIGN" and "BENIGN WITHOUT CALLBACK" negative, which amounts to 649 images in total. Then I used your code example_model_test.ipynb to test 3 models you provided on the project homepage. (ddsm_resnet50s10[512-512-1024]x2.h5 ddsm_vgg16_s10_512x1.h5 ddsm_vgg16s10[512-512-1024]x2_hybrid.h5). For the three models, I got auc of 0.69 (resnet), 0.75 (vgg) and 0.71 (hybrid) respectively, which are much lower than reported(0.86,0.83,0.85 respectively) qq 20180125204102 qq 20180125204124 qq 20180125204212 qq 20180125204245 Indeed I am using a different testset, since you mentioned in your paper your randomly split the DDSM data for training and test. But in this case, my testset should somehow overlap with your training set, resulting better rather than worse performance. Do you get an idea where is this discrepancy in performance comes from? Some preprocessing for example? Or I did something evidently wrong? Thank you very much! Best regards,

lishen commented 6 years ago

@taijizhao ,

The official test set was not available when I did the study so it could not be part of the train set. It is actually more like another hold out set.

Unfortunately, the scores are not as good as the ones on the test set I used. One thing you need to check is when you convert to PNG, the contrast is automatically adjusted. I used "convert -auto-level" to perform the conversion.

I also offer two reasons why the performance is worse:

The official test set is intrinsically more difficult (e.g., more subtle cases) to classify than the test set I used.
The official test set contains cases whose distributions do not bear similarity to the train set I used for model generation.

If you want to improve the scores on the test set, you shall do your own training on the train set.

As a side note (unpublished): I could achieve a single model AUC of 0.85 on the official test set when combining the CC and MLO views. Maybe you can do it even better.

xuranzhao711 commented 6 years ago

@lishen Thank you very much for your kind explanations! I'll try more. Just one thing I want to make clear: when converting dicom to png, SHOULD or SHOULD NOT the contrast being adjusted? Actually I did the conversion by dicom and opencv, something like this:

import dicom
import cv2
img = dicom.read_file(dicom_filename)
img = img.pixel_array
img = cv2.resize(img,(896,1152),interpolation=cv2.INTER_CUBIC)
cv2.imwrite(save_path+png_save_name, img)

In this way I think the contrast is not adjusted? and your comment

I used "convert -auto-level" to perform the conversion.

which python package this "convert auto-level" command is from? Thank you again!

lishen commented 6 years ago

@xuranzhao711 The way you converted it there was no contrast adjustment done. It does not matter whether it is right or wrong that you do contrast adjustment. But it's important to be consistent between model training and evaluation.

convert is simply a Linux command from ImageMagik. It is widely available. This the command I used:

convert -auto-level {} -resize 896x1152! ../ConvertedPNGs/{/.}.png

yueliukth commented 5 years ago

Hi @lishen ! I'm trying to reproduce your algorithm on DDSM official train/val/test split, but I observed a relatively large auc gap (around 8%) between val and test sets. So far, the best val auc I achieved is 83% and the test auc on the same model is 75%. I was wondering if you observed the same or at least similar auc gap when you trained and tested on this new official split. Otherwise, I guess maybe it means my model is somehow overfitting. Thank you in advance! Looking forward to your reply.

lishen commented 5 years ago

@irisliuyue, it's actually common to observe such gap between val and test sets. Sometimes, the val AUC is even lower than the test AUC. It means the val and test sets have different distributions. Unfortunately, it's actually hard to make them more even. If you can afford the computation, simply do multiple splits or use (nested) cross-validation.

yueliukth commented 5 years ago

Hi @lishen, thanks for your reply!

a) What did you mean by multiple splits? Do you suggest mix all train/val/test images and split, or just mix some train/val and split while leaving official test set untouched?

I don't see a huge difference between my train/val auc scores, so my model generalises well on unseen validation (but not unseen test). So I guess if I mix train and val and then do cross-validation, the test performance won't have a huge leap anyway.

b) And you are right, I did notice that sometimes val auc is even lower than test auc, but it's very rare. In general, from my observation mostly my test is around 8% lower than validation. One explanation could be val and test have something systematically different, for example like having different distributions as you said. I did try to plot histograms showing distributions of reading difficulties across train/val/test datasets according to BIRADS assessment, but their distributions are almost identical.

So I was wondering if you have any advice on how to prove different distributions across datasets? Thanks!

lishen / end2end-all-conv

Inconsistent results on DDSM testset #5