Expected behavior
Duplicate images get ignored and only one is stored.
Evidence
Desktop (please complete the following information):
OS: Windows 11
Browser: Chrome
Version 108.0.5359.125 (Official Build) (64-bit)
Additional context
This is a problem because as you can see from the three duplicate images, 2 are in training set and 1 in validation. This is going to lead to skewed results.
I checked those three images in the dataset and they have different hashes. They look similar, but have slight differences. We do remove duplicate images, as long as they have the same md5 hash.
Describe the bug After uploading datasets from RoboFlow universe, identical/duplicate images leaked into Source Images
To Reproduce Steps to reproduce the behavior:
Expected behavior Duplicate images get ignored and only one is stored.
Evidence
Desktop (please complete the following information):
Additional context This is a problem because as you can see from the three duplicate images, 2 are in training set and 1 in validation. This is going to lead to skewed results.