lindawangg / COVID-Net

COVID-Net Open Source Initiative
Other
1.15k stars 482 forks source link

Add RICORD data creation and include in train/test split #131

Closed andyzzhao closed 3 years ago

andyzzhao commented 3 years ago

Pull Request Template

Description

New workflow for creating COVIDx dataset:

  1. Download required images and files
  2. Run create_ricord_data/create_ricord_dataset.ipynb
  3. Run create_COVIDx.ipynb and/or create_COVIDx_binary.ipynb

Context of change

Please add options that are relevant and mark any boxes that apply.

Type of change

Please mark any boxes that apply.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

create_COVIDx.ipynb and create_COVIDx_binary.ipynb were ran to confirm 200 RICORD images were added to test and the rest were added to train.

Checklist:

Please mark any boxes that have been completed.

haydengunraj commented 3 years ago

@andyzzhao this looks good, but we might hold off on merging until we're ready to release this along with the BIMCV data. We'll discuss at tomorrow's meeting.

haydengunraj commented 3 years ago

@mayaliliya in the v8 label files, have we fixed the issues related to inconsistent labels that we had in the v7 dataset (e.g., #126)?

mayaliliya commented 3 years ago

@mayaliliya in the v8 label files, have we fixed the issues related to inconsistent labels that we had in the v7 dataset (e.g., #126)?

I fixed it in the data.py script with the brute force pop method. I am thinking we just merge this and then I will do a separate PR to address this bug next week as well as thoroughly addressing the duplicate issue (i.e. seeing if we can scavenge more images rather than removing all images with the same url base).