stark-t / PAI

Pollination_Artificial_Intelligence
5 stars 1 forks source link

Slicing the dataset when splitting in 2_split_dataset.py #22

Closed valentinitnelav closed 2 years ago

valentinitnelav commented 2 years ago

I think there is a minor data loss in 2_split_dataset.py when the slicing of the train, val, test lists is done at the lines 61-77. I think a few images are missed, but this doesn't impact the quality. Is good to keep this in mind for Paper 1 thought.

#Aufteilen der Bilder
dataset_length = len(image_files)
train_dataset_length = int(dataset_length * .7)
val_dataset_length = int(dataset_length * .2)
test_dataset_length = int(dataset_length * .1) 
# test_dataset_length should rather be defined like this: 
test_dataset_length = dataset_length - train_dataset_length - val_dataset_length
# otherwise train_dataset_length + val_dataset_length + test_dataset_length is different from dataset_length

#slice training dataset
train_images = image_files[0:train_dataset_length]
train_labels = label_files[0:train_dataset_length]
# All good.

#slice validation dataset
val_images = image_files[train_dataset_length+1:train_dataset_length + val_dataset_length]
val_labels = label_files[train_dataset_length+1:train_dataset_length + val_dataset_length]
# The above should rather be something like this:
val_images = image_files[train_dataset_length:(train_dataset_length + val_dataset_length)]
val_labels = label_files[train_dataset_length:(train_dataset_length + val_dataset_length)]
# There is no need to +1 in `train_dataset_length+1` because that actually jumps one element.
# Also, using the brakets helps a bit with reading the slicing operation.

#slice test dataset
test_images = image_files[train_dataset_length + val_dataset_length+1:-1]
test_labels = label_files[train_dataset_length + val_dataset_length+1:-1]
# The above should be:
test_images = image_files[(train_dataset_length + val_dataset_length):]
test_labels = label_files[(train_dataset_length + val_dataset_length):]
# Like for val case, there is no need of +1 in the start index of the slicing operation because 
# it jumps one element that could be included in the dataset.
# Also, the :-1 doesn't take the last element, to include the last element, 
# no index should be given, so just :

FYI, here is a slice minimal example I did so that I understand the issue:

x = [1,2,3,4,5,6,7,8,9,10]
len(x)
# 10

train_dataset_length = 6
val_dataset_length = 2
test_dataset_length = 2

train = x[0:train_dataset_length]
train
# [1, 2, 3, 4, 5, 6]

val = x[train_dataset_length+1:train_dataset_length + val_dataset_length] # current implementation in 2_split_dataset.py
val
# [8]
val = x[train_dataset_length:(train_dataset_length + val_dataset_length)] # proposed correction in slicing
val
# [7, 8]

test = x[train_dataset_length + val_dataset_length+1:-1] # current implementation in 2_split_dataset.py
test
# [] # empty!
test = x[(train_dataset_length + val_dataset_length):] # proposed correction in slicing
test
# [9, 10]

# Should avoid using -1. Check these:
x[0:-1]
# [1, 2, 3, 4, 5, 6, 7, 8, 9]
x[0:-2]
# [1, 2, 3, 4, 5, 6, 7, 8]
# This includes the last element
x[0:]
# [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
x[-1] # selects the last element only, same as x[-1:]
# 10
stark-t commented 2 years ago

This should be handeled in the new updated code, since datasampling will be done by its ID within a dataframe.

valentinitnelav commented 2 years ago

Great, I also try to make use of a data.frame when sampling for the field images. I move slowly with pandas :) I think I will start a separate GitHub repo for P2 and invite you there as I need feedback with code/double-check, or checking if things make sense.