Open Jordy-VL opened 1 year ago
RVL_CDIP has the issue of being 400K images and annotations would need to change to COCO format. It would be a great contribution to the document AI community if you could showcase this dataset's quality issues with your tool ;)
Hi @dnth!
I just wanted to let you know that I was able to run fastdup on RVL-CDIP
with the following results:
2023-06-22 11:56:43 [INFO] Found a total of 35106 fully identical images (d>0.990), which are 4.39 %
2023-06-22 11:56:43 [INFO] Found a total of 188747 nearly identical images(d>0.980), which are 23.59 %
2023-06-22 11:56:43 [INFO] Found a total of 769216 above threshold images (d>0.900), which are 96.15 %
2023-06-22 11:56:43 [INFO] Found a total of 40079 outlier images (d<0.050), which are 5.01 %
2023-06-22 11:56:43 [INFO] Min distance found 0.684 max distance 1.000
Sharing the analysis htmls here: analysis
I do believe that this shows the usefulness of your tools on this dataset, requiring further visual inspection with the visual-layer tool :)
Helly @Jordy-VL ! That's mindblowing how many duplicates are in the dataset! I think this would be very helpful to the community that works with this dataset. Thank you for sharing it :)
I would like to use your tool to investigate data noise in https://huggingface.co/datasets/aharley/rvl_cdip and https://ds4sd.github.io/icdar23-doclaynet/
It is known in the literature already that there is plenty of noise in RVL_CDIP, yet your tool could provide more quantitative insight.