Automatic generation of medical image dataset

Serge3006 commented 4 years ago

Maybe we can generate a medical image dataset by parsing the covid-papers that are related to diagnosis techniques using CT scans, x-rays images, etc. Usually, the papers results section is charged with lots of images, showing examples, graphs, etc.

A first proposal for the system pipeline could be: 1) Parsing all the pdfs and extracting all the figures. 2) Maybe a second parsing process to extract subfigures. 3) Classification of the figures (not all of them are medical images), in this point we could use a subset of the dataset to train a classifier or maybe use embeddings to differentiate medical images from the rest. We could also leverage information from captions.

I don't know if we could be able to extract a decent amount of images, it depends on how much papers addressed this type of diagnosis, but we give it a try.

Using these images we could then train classification models, using transfer learning, we could also apply visualization techniques to explain the most important regions of the image, etc.

Important Links: http://ai2-website.s3.amazonaws.com/publications/Siegel16eccv.pdf https://github.com/ieee8023/covid-chestxray-dataset

idafensp commented 4 years ago

This sound like something interesting. Not sure about the conclussions that could be derived from the images, but generally speaking having them with some annotations extracted using NLP would be valuable and of use to many people.

In our group at ULPGC we have experience working with medical image (that is indeed our main topic) and I have been working on it lately. We have cases of applying transfer learning to segmentation as well.

So if you are going for this and think I can help, I would be happy to do so.

Serge3006 commented 4 years ago

Hi Idafen, that would be great. Yeah, the annotated dataset using NLP would be valuable also, even if they are just a few images. Tomorrow I will start parsing some of the pdfs that I've been given access. They are more than 8000. I will make some tries to extract the figures and the captions, and then we can think about the medical images itself that I think you can help me a lot :). Let's talk on Slack.

idafensp commented 4 years ago

Sounds great. My point in here is that most likely the information that we can extract from the images would be not so relecant, as they most likely are going to be just single samples and low-quality figures. But if we are able to relate that to the rest of the text and/or datasets linked in the paper, that could be highly relevant.

oeg-upm / covid19

Automatic generation of medical image dataset #7