roboflow / external-bugtracker

1 stars 2 forks source link

Duplicate Images in Source Images #7

Closed shukkkur closed 10 months ago

shukkkur commented 1 year ago

Describe the bug After uploading datasets from RoboFlow universe, identical/duplicate images leaked into Source Images

To Reproduce Steps to reproduce the behavior:

  1. Go to 'https://universe.roboflow.com/universidade-de-coimbra-qax7o/volleyball-fvtfx'
  2. Click on 'Download this Dataset'
  3. Scroll down to 'Yolo v7 PyTorch'
  4. Upload to your Workspace

Expected behavior Duplicate images get ignored and only one is stored.

Evidence Screenshot_20221228_101906

Desktop (please complete the following information):

Additional context This is a problem because as you can see from the three duplicate images, 2 are in training set and 1 in validation. This is going to lead to skewed results.

SolomonLake commented 10 months ago

I checked those three images in the dataset and they have different hashes. They look similar, but have slight differences. We do remove duplicate images, as long as they have the same md5 hash.

shukkkur commented 10 months ago

Okey, thank you)