ieee8023 / covid-chestxray-dataset

We are building an open database of COVID-19 cases with chest X-ray or CT images.
3k stars 1.28k forks source link

Sharing my data #21

Closed AleGiovanardi closed 4 years ago

AleGiovanardi commented 4 years ago

Hi I am doing some research on this topic applying CNN with deep learning to create an automated comupter vision based scanner to detect covid posivites and negatives scans.

Here you can find my dataset, I am currently building a CT scans dataset to try and train a model for ct scan other than rx scans. https://github.com/AleGiovanardi/covidhelper/tree/master/dataset/covidct

I also have a source of new rx and cts directly from italian hospital so i will update it periodically. You are welcome to take any of the data in my repo which are missing from here.

You can find also a code which train a model, save it and let you use it to test detection of scans, which is based on Adrian Rosebrock tutorial on pyimagesearch. I am constantyl working to enhance the performance and the accuracy of it.

Also thanks for your great job, this inspired me a lot!

ieee8023 commented 4 years ago

I believe you are using the same images from this repo? It would help me if you could identify the ones that are not present in this dataset and have the source that they came from so I can correctly add the metadata.

The normal cases you are using are pediatric xrays so it your model just has to predict age and not the presence of the infection.

AleGiovanardi commented 4 years ago

I added more than 250 cases to normal dataset and more to Covid positives. Also I built a CT scans dataset which counts a lot of positives confirmed cases. I trained some models on CT instead of RX and I can affirm the pathway is much more reliable. I also debugged my model with grad-CAM and found that the model is looking quite fine at the right regions in the image while learning. Again you can check the news on my repo.

Kind regards.

RazaGR commented 4 years ago

@AleGiovanardi have you seen ieee8023 comment tp Adrian Rosebrock here https://github.com/ieee8023/covid-chestxray-dataset/issues/20#issuecomment-600154029

I find the main issue with your work is the evaluation. You simply cannot claim the model has such high performance. You are only using train and valid sets and no external test set. You should also perform some sort of cross-validation to assess how well the model generalizes.

Have you comply with above?

AleGiovanardi commented 4 years ago

I test the models against unknown images. Images used for testing ARE NOT used for training the model. In the "original" dataset given in the tutorial there are two main issues: first the normal cases dataset for training is composed mainly by pediatric images. Second the dataset is really little in numbers.

I am trying to fix this adding more cases to both normal and covid dataset. I actually count 264 normal cases and 51 COVID positive confirmed cases for RX training of my model. I am also building a dataset for CT images for further testing on this kind of scans, because today CT is highly more performant and precise (and mostly used in many countries).

I am currently adding more cases daily which I can find on the web and also I have a direct source in one of Center Italy hospital which can provide fresh RX and CT scans materials.

I made many tests and althought accuracy and sensibility are quite high the dataset need to be increased further with covid positive confirmed cases before we can test it properly.

Anyway I tested my models with both RXs and CTs and the initial results are little but encouraging.

I actually have set 100 Epochs with 8 Batch size. Train for 29 steps, validate on 60 samples on the last training happening right now. I try then to test the model against brand new images in this way: 1) Test against all negative to find false positives. 2) Test against all positive to find false negatives. 3) Test against mixed batch of images (the more the better) where i put only 1 positive and all the remaining negatives. Then I try the inverse. 4) Test a 50/50 positive/negatives batch of images 5) Test a single random image(both negative or positive).

I will start some heavier and much higher augmented computations of the model asap, trying to augment the cycles of data augmentations, changing learning rate and batch sizes to collect more empirical results.

Results are quite accurate but yet we must remember the dataset is ridicously low. Anyway I think that under a medical control eye and with some more training and a larger datset we can obtain better results every time development goes on.

PS: An important fact i noted: given the small dataset for training the models, images recognition is still highly disturbed by the compoistion of itself. If you train a model with all similar looking images (the smaller the dataset, the higher the noise), when you submit a differently cropped or modified image which has different gamma and main coloration this can make a lot of noise/.