stark-t / PAI

Pollination_Artificial_Intelligence
5 stars 1 forks source link

Random seed for reproducibility in data split in 2_split_dataset.py #21

Closed valentinitnelav closed 2 years ago

valentinitnelav commented 2 years ago

I realized that at these lines, in the file 2_split_dataset.py, I do not get full reproducibility. If I run this 2 times, I get 2 different results.

#zufällige aufteilung der bilder
temp = list(zip(image_files, label_files))
random.Random(555).shuffle(temp)
image_files, label_files = zip(*temp)

I had a look here, Shuffle two list at once with same order, but didn't find an answer to actually fix the issue. All of them do not give full reproducibility.

I also tried this, but same problem:

random.seed(123)
random.shuffle(temp)
image_files, label_files = zip(*temp)

Do you know how to fix this issue?

valentinitnelav commented 2 years ago

Ok, figured this out. To assure reproducibility, I need to avoid running those lines of code separately. I have to run the entire block of code at once because we overwrite image_files & label_files and, of course, if I run the shuffling again, it runs on something shuffled already and each time gives different results.