gurugaurav / bing_image_downloader

Python library to download bulk of images from Bing.com
https://pypi.org/project/bing-image-downloader/
MIT License
198 stars 99 forks source link

Duplicate Images #9

Open ansariyusuf opened 3 years ago

ansariyusuf commented 3 years ago

I am trying to create a food dataset. However, when I try to scrape from Bing using this library, I am getting a lot of duplicate images. Please assist.

Thank you

NickT5 commented 3 years ago

My first attempt to filter out duplicates would be to subtract two possible duplicated images and check if the difference is close to zero.

atsbomb commented 3 years ago

I'm getting the same. Downloaded 10000 pictures and 9789 of them were duplicates. Is this a nature of Bing image search, or particular to this downloader?

jane-cz commented 3 years ago

When I scrape 100 photos, after the first 85 to 90 images, they start to repeat, and the rest are all duplicates. When I scrape 500 photos, 370 of them are duplicates :( Other than this it works great, so I really hope this issue can get fixed.

AbhiDhariwal commented 3 years ago

Ya I also faced same issue it was due to how its programed i.e there is no next page in bing so instead first=pagecounter -> do first len of total url visited also added ignore duplicates if same url is already visited i will also pull the code or you can visit https://github.com/AbhiDhariwal/bing_image_downloader

shoppel commented 3 years ago

I successfully avoided duplicated images with the following code. But now it will search forever. So yeah, maybe we need a next button for more images.

` self.duplicates = set()

def save_image(self, link, file_path):
    request = urllib.request.Request(link, None, self.headers)
    image = urllib.request.urlopen(request, timeout=self.timeout).read()

    if not imghdr.what(None, image) or image in self.duplicates:
        print('[Error]Invalid image, not saving {}\n'.format(link))
        raise
    else:
        self.duplicates.add(image)

    with open(file_path, 'wb') as f:
        f.write(image)

`

sid7631 commented 3 years ago

Remove duplicates PR#20

annabaringer commented 2 years ago

Bumping this as an issue. The fix above looks like it works and would be great if merged. Thanks!

sid7631 commented 2 years ago

Please close this issue