lindawangg / COVID-Net

COVID-Net Open Source Initiative
Other
1.15k stars 477 forks source link

COVIDx7A dataset issue with text file #126

Closed Electro1111 closed 3 years ago

Electro1111 commented 3 years ago

I believe there might be an issue with the text files here because in the data loader script:

for i in range(len(batch_files)): sample = batch_files[i].split()

        if self.is_training:
            folder = 'train'
        else:
            folder = 'test'

        x = process_image_file(os.path.join(self.datadir, folder, sample[1]),
                               self.top_percent,
                               self.input_shape[0])

batch_files[i] is a single line of the .txt file and sample[1] takes the 1th item in the line.split()

however, in the train_covidx7A.txt file this will not agree in the sirm dataset:

'ANON136 DX.1.2.840.113564.1722810162.20200405112431725920.1203801020003.png COVID-19 actmed\n', 'ANON188 DX.1.2.840.113564.1722810162.20200405142816863980.1203801020003.png COVID-19 actmed\n', 'ANON68 DX.1.2.840.113564.1722810162.20200420135116095500.1203801020003.png COVID-19 actmed\n', 'COVID 1 COVID(1).png COVID-19 sirm\n', 'COVID 2 COVID(2).png COVID-19 sirm\n',

for example there is a space between the "COVID" and "2" in the last line, so line.split()[1] will not give the file name but rather the number 2. this will likely cause errors in training.

calzoom commented 3 years ago

I got around this inconsistency by using line_split[-2] and line_split[-3] for the label and filename respectively

Electro1111 commented 3 years ago

Yeah but then this won’t work on test set because the test set is inconsistent in that the files from rsna lines don’t have the rsna dataset label, and also has the same issue for the sirm issues as well so some of them it’s going to be

1,2 for img name and class or -2,-1 for non rsna and non sirm images 1,2 for img name and class or -3,-2 for rsna 2,3 for img name and class or -2,-1 for sirm

So you should probably condition the indices used by the datasource label in the test set as well, or length of the line.split()

On Sun, Feb 7, 2021 at 1:42 PM Japjot Singh notifications@github.com wrote:

I got around this inconsistency by using line_split[-2] and line_split[-3] for the label and filename respectively

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lindawangg/COVID-Net/issues/126#issuecomment-774773165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP7YI3U2AP4SCY4VMRRW4VLS54CMRANCNFSM4XHKUNRQ .

calzoom commented 3 years ago

can you point me to which files don't follow the format? I looked through the test set and every image has the label as the second to last element and the filename is immediately before that

Electro1111 commented 3 years ago

in the first 197 lines of https://github.com/lindawangg/COVID-Net/blob/master/labels/test_COVIDx7A.txt

the class label is the LAST element because there is no label for the data source (rsna)

after line 197 the last element is the data source and the second to last is the class name

but then lines 284-300 there is an extra space in the first element so it gets split into two

COVID 70 COVID(70).png COVID-19 sirm COVID 72 COVID(72).png COVID-19 sirm COVID 77 COVID(77).png COVID-19 sirm COVID 81 COVID(81).png COVID-19 sirm COVID 87 COVID(87).png COVID-19 sirm COVID 94 COVID(94).png COVID-19 sirm COVID 95 COVID(95).png COVID-19 sirm COVID 106 COVID(106).png COVID-19 sirm COVID 107 COVID(107).png COVID-19 sirm COVID 116 COVID(116).png COVID-19 sirm COVID 119 COVID(119).png COVID-19 sirm COVID 129 COVID(129).png COVID-19 sirm COVID 131 COVID(131).png COVID-19 sirm COVID 213 COVID(213).png COVID-19 sirm COVID 214 COVID(214).png COVID-19 sirm COVID 215 COVID(215).png COVID-19 sirm COVID 216 COVID(216).png COVID-19 sirm

haydengunraj commented 3 years ago

Closing this now, as I believe the problem was fixed and we are also now on version 8 of the dataset.