Duplicate filenames in train/test text files as per `create_COVIDx_v2.ipynb` and a way to resolve the issue

zeeshannisar commented 4 years ago

Hi Dear. First of all, thank you so much for sharing the data and network. Though you have removed the duplicates from test_COVIDx.txt but as per me, there are some duplicate filenames in train_COVIDx.txt file. It is requested to please add the following function at the last of your create_COVIDx_v2.ipynb notebook. This function will resolve all the duplicate issues and will sort all of the images in train/test data to their respective subfolders (i.e., Normal, Pneumonia and COVID-19) as following.

1- train
           |_____ normal
                    |_____ 7,966 images
           |_____ pneumonia
                    |_____ 5442 images
           |_____ COVID-19
                    |_____ 92 images

2- test
           |_____ normal
                    |_____ 885 images
           |_____ pneumonia
                    |_____ 594 images
           |_____ COVID-19
                    |_____ 10 images

The function is

import pandas as pd
import shutil
from tqdm import tqdm_notebook as tqdm

def ArrangeData_LabelNamedFolders(file_path, folder_path, dest_folder_path, indicator):
    print('{} Operation'.format(indicator))
    df = pd.read_csv(file_path, sep=' ', names=['patientid', 'filename', 'label'])
    df = df.drop_duplicates(subset='filename', keep="first")
    labelFolders = df.label.unique()
    print(labelFolders)
    for labelFolder in labelFolders:
        if not os.path.exists(dest_folder_path+'/'+labelFolder):
            os.makedirs(dest_folder_path+'/'+labelFolder)
    imageNames = sorted(os.listdir(folder_path))
    for imageName in tqdm(imageNames):
        temp_df = df.loc[df['filename']== imageName]
        class_ = temp_df['label'].values.item()
        src = folder_path +'/' + imageName
        dest = dest_folder_path + '/' + str(class_) + '/' + imageName 
        shutil.copy(src, dest)

train_file = 'train_split_v2.txt'
train_folder = './data/train'
dest_train_folder = './categorize data/train'

test_file = 'test_split_v2.txt'
test_folder = './data/test'
dest_test_folder = './categorize data/test'

ArrangeData_LabelNamedFolders(test_file, test_folder, dest_test_folder, indicator='Test')
ArrangeData_LabelNamedFolders(train_file, train_folder, dest_train_folder, indicator='Train')

sfleisch commented 4 years ago

I generate a dict containing the dicom files to convert as the key then run the conversion.

lindawangg commented 4 years ago

Duplicates are now removed in new version.

lindawangg / COVID-Net

Duplicate filenames in train/test text files as per `create_COVIDx_v2.ipynb` and a way to resolve the issue #26