leoll2 / MedicalCNN

Abnormality detection in mammogram images using Deep Convolutional Neural Networks
MIT License
18 stars 6 forks source link

Train-Test Split & Class Distribution #4

Closed FSKhan19 closed 3 years ago

FSKhan19 commented 3 years ago

Hello leoll2, i hope you are fine and doing well, First i want to thank you for sharing with AI community your source code and various of experiments to grow more & more AI. I am here to ask you some questions regrading your numpy tensor & png folder, I am working on benign vs malignant classification for that first i looked into your png-images and found these results

split-count-split percentage Train images : 2676 - 79.93% Test Images : 672 - 20.07% Total images: 3348

Then I looked into more for each class distribution

Train

BENIGN : 1568 - 58.59% MALIGNANT : 1108 - 41.41% Total: 2676

Test

BENIGN : 406 - 60.42% MALIGNANT : 266 - 39.58% Total: 672

as you can see train images = 2676 & Test Images = 672 But when i checked into your jupyter notebook there Train size: 2676 Test size: 336

Where other test images gone ?

Then i explored more and check your train & validation generators train gen contains 17 batches and validation batches 5 so in train = 17128=2176 validation = 5128=640 2176+640 = 2816 which is more than train images why ?

My last question is what criteria you used for image rescaleing.

Thank you & regards Farhan Shahid

leoll2 commented 3 years ago

Hello Farhan, thanks for appreciating my project. Good catch, I see that the notebook reports 336 test images, the loading code is very straightforward so I trust it more than other numbers reported elsewhere. 672 is indeed the double of 336, so could it be that I accidentally counted also the labels or the related baseline images? Anyway the results should remain valid.

As for the train-validation split, I don't know where did you find the numbers 17, 5 and 128. It looks like the batch size is 32 and there are 66 and 17 minibatches for training and validation, respectively.

I didn't do ddsm pre-processing myself, so I don't have the rescaling code. Afaik, the abnormality patches were extracted from CBIS DDSM images according to the binary mask, then resized with OpenCV to 150x150.

FSKhan19 commented 3 years ago

Hi Thanks for quick feedback. Maybe, the reason why i am here is when i use your numpy tensor and do training i got same result as you mentioned in your research work but if i do same thing with your png_images i got very low results. so, i am confused why this is happening to me. between i download cbis-ddsm dataset from official repository and per-process same as you did but got results different than you. Can i know where from you got numpy tensor so i can talk with him/her.

for the train-validation split reference i was talking about your VGG16_2_class notebook i am attaching image for you to show what exactly i was doing and there were my calculation error to due to rounding number

2021-06-12_184822

Thanks again.

leoll2 commented 3 years ago

Try to visualize both your png images and the numpy tensor, and check if there is any noticeable differences between the two. Suggested sanity checks:

The number of images is not divisible by the batch size, hence the rounding error. There is a line inside fit_generator which is like:

steps_per_epoch=int(0.8*n_train_img) // 128,

It causes the last minibatch of epoch, whose size is smaller than the other minibatches, to be simply discarded. This is pretty common in ML and not really an issue because the dataset is shuffled everytime.

FSKhan19 commented 3 years ago

Try to visualize both your png images and the numpy tensor, and check if there is any noticeable differences between the two. Suggested sanity checks:

* aspect ratio

* number of channels

* scale (i.e. 0-1 or 0-255, ...)

The number of images is not divisible by the batch size, hence the rounding error. There is a line inside fit_generator which is like:

steps_per_epoch=int(0.8*n_train_img) // 128,

It causes the last minibatch of epoch, whose size is smaller than the other minibatches, to be simply discarded. This is pretty common in ML and not really an issue because the dataset is shuffled everytime.

ok i will check , where from you got numpy file ?

leoll2 commented 3 years ago

The files were provided to me as part of a university assignment. I have no longer contacts with those people, so I'm afraid I can't help in this sense.