lindawangg / COVID-Net

COVID-Net Open Source Initiative
Other
1.15k stars 480 forks source link

Dataset generation is not working #122

Closed AFAgarap closed 3 years ago

AFAgarap commented 3 years ago

Issue Template

Before posting, have you looked at the FAQ page?

Yes. My question is not addressed there.

Description

Please include a summary of the issue. The dataset generation notebooks might be out-of-date (create_COVIDx.ipynb and create_COVIDx_binary.ipynb). When I ran the notebooks, they both have failed even though I changed the directory of the dataset folders.

Please include the steps to reproduce. I followed the steps in COVIDx.md.

List any additional libraries that are affected. None

Steps to Reproduce

I followed the steps in data generation.

Expected behavior

The one in the notebooks

Actual behavior

When I remove the following line,

imagename = patientid.split('(')[0] + ' ('+ patientid.split('(')[1] + '.' + row['FORMAT'].lower()

The 4th cell of create_COVIDx_binary.ipynb passes with the following output,

Data distribution from covid datasets:
{'negative': 373, 'normal': 0, 'pneumonia': 57, 'COVID-19': 1770}

This is okay, right? But when I get to the 6th cell, this is the output,

Key:  negative
Test patients:  ['ANON148', 'ANON6', 'ANON152', 'ANON93', 'ANON2', 'ANON193', 'ANON156', 'ANON28', 'ANON143', 'ANON186', 'ANON15', 'ANON65', 'ANON128', 'ANON168', 'ANON120', 'ANON194', 'ANON216', 'ANON131', 'ANON175', 'ANON141']
Key:  pneumonia
Test patients:  ['8', '31']
Key:  COVID-19
Test patients:  ['19', '20', '36', '42', '86', '94', '97', '117', '132', '138', '144', '150', '163', '169', '174', '175', '179', '190', '191COVID-00024', 'COVID-00025', 'COVID-00026', 'COVID-00027', 'COVID-00029', 'COVID-00030', 'COVID-00032', 'COVID-00033', 'COVID-00035', 'COVID-00036', 'COVID-00037', 'COVID-00038', 'ANON24', 'ANON45', 'ANON126', 'ANON106', 'ANON67', 'ANON153', 'ANON135', 'ANON44', 'ANON29', 'ANON201', 'ANON191', 'ANON234', 'ANON110', 'ANON112', 'ANON73', 'ANON220', 'ANON189', 'ANON30', 'ANON53', 'ANON46', 'ANON218', 'ANON240', 'ANON100', 'ANON237', 'ANON158', 'ANON174', 'ANON19', 'ANON195', 'COVID-19(119)', 'COVID-19(87)', 'COVID-19(70)', 'COVID-19(94)', 'COVID-19(215)', 'COVID-19(77)', 'COVID-19(213)', 'COVID-19(81)', 'COVID-19(216)', 'COVID-19(72)', 'COVID-19(106)', 'COVID-19(131)', 'COVID-19(107)', 'COVID-19(116)', 'COVID-19(95)', 'COVID-19(214)', 'COVID-19(129)']
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-9-37cbccc040e2> in <module>
     67             if patient[3] == 'sirm':
     68                 image = cv2.imread(os.path.join(ds_imgpath[patient[3]], patient[1]))
---> 69                 gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
     70                 patient[1] = patient[1].replace(' ', '')
     71                 cv2.imwrite(os.path.join(savepath, 'train', patient[1]), gray)

error: OpenCV(4.2.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

Environment

Python 3.7.7 (default, Jul 21 2020, 10:29:19) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
LucasPMoreira commented 3 years ago

I got the similar issues trying to run create_COVIDx code.

First, the xlrd package, used by Pandas to open Excel files, does not support .xlsx file format anymore, so the code to open metadata file in the 3rd cell does not work. The xlrd developers suggest to install openpyxl package and adapt the read function to use this engine, like below:

sirm_csv = pd.read_excel(sirm_csvpath, engine='openpyxl')

I also got an error running the line:

imagename = patientid.split('(')[0] + ' (' + patientid.split('(')[1] + '.' + row['FORMAT'].lower()

As could inspect, the metadata file 'COVID.metadata.xlsx' has the user ID string without the parenthesis, while the actual image filenames have them. The code above try to split the string based on a parenthesis character that does not exist in the metadata file.

@AFAgarap I don't believe that removing the line you mentioned would help you (or us). That line removes the extra space characters from the file names in order to open them (the actual file names have this space between 'COVID-19' string and the user ID). I think that this is your issue running the 6th cell, removing the filename editing function, you will not be able to open the images later.

Anyway, there are some changes to do in the code, and I also suggest to update them.

AFAgarap commented 3 years ago

Sorry, I wasn't able to update here. But I finally got it working. I changed something in their code, particularly in the filename.

GiovanniTurri commented 3 years ago

The issue is caused by an update of the Kaggle dataset.

I solved changing the name of the files into "COVID-19" (instead of the new "COVID") so that the enumeration with parenthesis is given by the O.S.

Also I edited the "COVID.metadata.xlsx" using =CONCATENATE("COVID-19(", E2,")") where column E is 1 to 1199 in order to use the old scripts

Exactly as @LucasPMoreira said

And installed Pillow to remove the cv2.cvtColor error

AFAgarap commented 3 years ago

Cell 4 was the actual problem for me, I solved it by removing the line I referenced, and then replaced it with the following,

imagename = "COVID ({}).png".format(imagename.rsplit(".png")[0].split("COVID ")[1])

That's the only part I changed. It's for both the generation of binary and multi classification datasets notebooks.

AFAgarap commented 3 years ago

And installed Pillow to remove the cv2.cvtColor error

I didn't experience any problem with OpenCV with regards to Pillow since I already had Pillow installed even before this.

AlexSWong commented 3 years ago

This issue has been resolved with the release of the COVIDx V7A and V7B datasets, where in addition to a larger patient cohort the generation scripts have been modified based on changes to file structures in the other databases.

AFAgarap commented 3 years ago

Resolved in 11635f7662284ca7b3075e814b33fd93bc94c127