qxiaobu / FLANNEL

3 stars 4 forks source link

get_covid_data_dict.py Missing Classifications? #2

Open beyerch opened 3 years ago

beyerch commented 3 years ago

While reviewing the metadata.csv file for the covid-chestxray-dataset-master information, it appears that are many classifications that are not accounted for. While some make sense, others would appear that they should be included. This could be due to a change in the format of this file; however, could you confirm this is intended?

In the image below, the distinct findings are shown on the right hand side. The yellow highlighted items are forms of pneumonia which would not get picked up by the python script's logic. As they are not included in the lists defined in the processing script.

image

qxiaobu commented 3 years ago

Yeah, there are rich finding in the latest covid dataset. Unfortunately, I used a older version from March 2020 and it just had less than 200 images and had raw labels in findings. Maybe, you can try it using the latest data. Looking forward to you update

beyerch commented 3 years ago

o

Happy to update; however, just want to confirm how you think they should be handled.

Assume anything under Pneumonia/Bacterial/.... goes to pneumonia_bacteria Assume anything under Pneumonia/Viral/... goes to pneumonia_virus

Should the other items simply be excluded at this point? (e.g. Tuberculosis, unknown, todo, Pneumonia/Fungal, Pneumonia/Aspiration) ?

beyerch commented 3 years ago

Also, do you know if there is a way to acquire the March 2020 dataset? In the short-term, I'm looking to recreate the work that you did and having the same starting point would make it easier.

qxiaobu commented 3 years ago

In early stage, labeled images are few and coarse, hence we aimed to distinguish/detect the Covid-images from Pneumonia(virus + bacterial) images and Healthy images. Then, we select Kaggle data as the complementary data (which has rich Pneumonia images and Healthy images) for the task. Your data has richer and fine-grained labels, it is more significant for covid recognition.

I am not sure whether the other-items have rich images for training? If have, maybe it is better to make multi-class classification with larger label size (far more than 4 classes); If not, maybe it is also reasonable to exclude other items and just focus on Pneumonia and Covid.

qxiaobu commented 3 years ago

I have uploaded the metadata.csv in FLANNEL/original data/. Maybe it is helpful to find the corresponding images.