visual-layer / visuallayer

Simplify Your Visual Data Ops. Find and visualize issues with your computer vision datasets such as duplicates, anomalies, data leakage, mislabels and others.
https://www.visual-layer.com/
Apache License 2.0
67 stars 1 forks source link

Feature request: RVL_CDIP and DocLayNet #44

Open Jordy-VL opened 1 year ago

Jordy-VL commented 1 year ago

I would like to use your tool to investigate data noise in https://huggingface.co/datasets/aharley/rvl_cdip and https://ds4sd.github.io/icdar23-doclaynet/

It is known in the literature already that there is plenty of noise in RVL_CDIP, yet your tool could provide more quantitative insight.

Jordy-VL commented 1 year ago

RVL_CDIP has the issue of being 400K images and annotations would need to change to COCO format. It would be a great contribution to the document AI community if you could showcase this dataset's quality issues with your tool ;)

dnth commented 1 year ago

Hi @Jordy-VL thank you for the comment. We will add this to our roadmap. In the meantime, you can also try it out yourself using our no-code platform here for free.

Or if you're feeling adventurous to run some code, try using fastdup.

Jordy-VL commented 1 year ago

Hi @dnth!

I just wanted to let you know that I was able to run fastdup on RVL-CDIP with the following results:

2023-06-22 11:56:43 [INFO] Found a total of 35106 fully identical images (d>0.990), which are 4.39 %
2023-06-22 11:56:43 [INFO] Found a total of 188747 nearly identical images(d>0.980), which are 23.59 %
2023-06-22 11:56:43 [INFO] Found a total of 769216 above threshold images (d>0.900), which are 96.15 %
2023-06-22 11:56:43 [INFO] Found a total of 40079 outlier images         (d<0.050), which are 5.01 %
2023-06-22 11:56:43 [INFO] Min distance found 0.684 max distance 1.000

Sharing the analysis htmls here: analysis

I do believe that this shows the usefulness of your tools on this dataset, requiring further visual inspection with the visual-layer tool :)

dnth commented 1 year ago

Helly @Jordy-VL ! That's mindblowing how many duplicates are in the dataset! I think this would be very helpful to the community that works with this dataset. Thank you for sharing it :)