openimages / dataset

The Open Images dataset
https://storage.googleapis.com/openimages/web/index.html
Apache License 2.0
4.25k stars 605 forks source link

How to report invalid/questionable images? #94

Closed monocongo closed 4 years ago

monocongo commented 4 years ago

In the course of my work with images downloaded from OpenImages, I have come across a number of problematic images/bounding box annotations that I think should be removed or rectified. Is there a mechanism in place for this sort of QA?

A few examples of images from the "Person" class with wonky bounding boxes are attached below for illustration. invalid_image_6aba33edb9fe95cb invalid_image_bcfc243a325a2ca9 invalid_image_5dc899d00131b7d4 invalid_image_f9c48ae8a9e84897 invalid_image_77f632ec95bf07da invalid_image_5bc24fd1de23195c invalid_image_74bcf3d8c63eae74 invalid_image_1d3cfb4202e92607 invalid_image_d7c08ba3bb124f50 invalid_image_2c3f33943376fff3

monocongo commented 4 years ago

Here's are some "Car" images:

invalid_image_48caab780670bb94 invalid_image_fb3a264dee758205 invalid_image_64ed875ee9b3a61c invalid_image_a34cbc1c83c753e9 invalid_image_304ac9e670d86a38

jponttuset commented 4 years ago

Dear @monocongo, We do not have any feedback loop for this sort of feedback. However, some of these are correct according to our definition:

monocongo commented 4 years ago

Thanks for your response, @jponttuset.

Before including images from OpenImages in my dataset I first filter out all images with these attributes marked as true so none of the above images should be marked as such:

    # filter out images that are occluded, truncated, group, depiction, inside, etc.
    for reject_field in ("IsOccluded", "IsTruncated", "IsGroupOf", "IsDepiction", "IsInside"):
        df_images = df_images[df_images[reject_field] == 0]

I realize that this dataset is free and you get what you pay for, and my thoughts were to help with the quality control as a small contriibution to the project. I don't get the impression from your response that this is of much interest, nor is there a mecahnism in place to facilitate improvements when issues are detected. If this changes or if I'm mistaken then please contact me if I can help. I have lists of images from the dataset that are problematic that I use in my own work as an exclusion filter, in case that would be useful to others. I have not been able to go through more than a couple thousand images so far but I have found roughly 10% of the images to be problematic, so it appears that the dataset could benefit from additional attention to quality control.

jponttuset commented 4 years ago

Hi @monocongo,

I don't get the impression from your response that this is of much interest, nor is there a mecahnism in place to facilitate improvements when issues are detected.

At the scale of Open Images, there is no easy way of incorporating this type of feedback, as we would need to verify any of the flagged content, with probably diminishing returns. I understand where you're coming from but I encourage you to think at the scale of 15 million boxes.

In any case, thank you for your feedback and for sharing the list of problematic images and I hope that Open Images is useful for your work despite its imperfections.

monocongo commented 4 years ago

I'm by no means an expert, but I wonder how useful a dataset is for training models for object detection if it's as dubious as Open Images appears to be? My assumption is that this sort of quality consideration should matter more than it appears to, as there seems to be an unexpectedly/surprisingly high percentage of low-quality images/boxes in this dataset. It may be that the adage "garbage in / garbage out" counterintuitively doesn't apply so much to the areas of endeavor where this dataset might typically be used? My assumption has been that removing questionable images such as the ones shown above will result in better training outcomes when using the dataset as training input for object detection models. Perhaps I should have run some experiments to verify this assumption before pestering you guys about it.

In any event, thanks for the work you guys do to provide this dataset to the community -- while not perfect it's nevertheless quite useful. Very appreciated!