Mismatching file names between image and label files when splitting the data set in 2_split_dataset.py

valentinitnelav commented 2 years ago

Hi @stark-t ,

I run today on a JupyterLab notebook a similar code taken from file 2_split_dataset.py, lines:

image_files = os.listdir(os.path.join(dataset_PATH, 'images'))
label_files = os.listdir(os.path.join(dataset_PATH, 'labels'))

#zufällige aufteilung der bilder
temp = list(zip(image_files, label_files))
random.Random(555).shuffle(temp)
image_files, label_files = zip(*temp)

In my case, os.listdir doesn't return the file names in alphabetical order, so one needs to be extra careful with the correspondence between image and annotation file names.

For example, if you display the first and last element of temp, you can get

[temp[0], temp[-1]]

[('Lepidoptera_Papilionidae_Parnassius_nordmanni_2838703851_5515648.jpg',
  'Orthoptera_Pamphagidae_Asiotmethis_limbatus_2992697396_344801.txt'),
 ('Araneae_Theridiidae_Pholcomma_gibbum_3031951124_2006549.jpg',
  'Hymenoptera_Formicidae_Formica_rufibarbis_3335448210_2808249.txt')]

I think to solve this issue one needs to sort both lists before doing the shuffeling:

image_files.sort()
label_files.sort()

Then you get them aligned:

[temp[0], temp[-1]]

[('Araneae_Agelenidae_Agelena_labyrinthica_3079669635_2227294.jpg',
  'Araneae_Agelenidae_Agelena_labyrinthica_3079669635_2227294.txt'),
 ('Orthoptera_Trigonidiidae_Trigonidium_cicindeloides_3355089148_89705.jpg',
  'Orthoptera_Trigonidiidae_Trigonidium_cicindeloides_3355089148_89705.txt')]

Not sure how this worked in your case, because otherwise, the YOLO model would have given complete random results since images do not get paired with their actual annotation files. Or maybe is a particular behavior only on Linux? I saw this suggested on SO here:

Note that the order that os.listdir gets the filenames is probably completely dependent on your filesystem

Whatever the case, to be defensive, we should always sort the two lists.

FYI: to check for equality I did this:

image_names = [x.split('.')[0] for x in image_files]
label_names = [x.split('.')[0] for x in label_files]
image_names == label_names # expect True

valentinitnelav commented 2 years ago

Hm, how do we get rid of the github-actions bot regarding the message above "thank you for your interest in YOLOv5..."? :D

stark-t commented 2 years ago

we should check this for the updated code.

Right now I'm using pandas dataframe and match images and labels by their file names.

class_df = pd.merge(image_df, label_df, on='file_names', how='outer')

outer will in theory keep the images without labels in the dataframe, but later on all nans can be disabled.

if not config.no_labels: df = df.dropna() df.reset_index(drop=True, inplace=True)

stark-t / PAI

Mismatching file names between image and label files when splitting the data set in 2_split_dataset.py #19