scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

ImagePipeline breaks on invalid images #5079

Closed Gallaecio closed 2 years ago

Gallaecio commented 3 years ago

@apalala reported this issue when a site sends some invalid images, and suggested a fix for the spider middleware:

        image_stream = self.get_images(response, request, info, item=item)
        while True:
            try:
                path, image, buf = next(image_stream)
            except StopIteration:
                break
            except Exception:
                continue

I’ve not checked if there’s a cleaner fix possible (i.e. closer to the call to Pillow code), but it should be trivial to fix either way. I suspect writing a test may be the hardest part here.

marlenachatzigrigoriou commented 3 years ago

Hello, I am new here. I would like to contribute to this issue.

apalala commented 3 years ago

Let me gather a few broken images from fandangonow.com on the next run, @Gallaecio.

The exception raised by PIL is OSError.

apalala commented 3 years ago

Examples of broken images:

Gallaecio commented 3 years ago

@marlenachatzigrigoriou Thank you! Please, let us know if you have any question.

It should be possible to reproduce this issue by creating a Scrapy spider that uses ImagePipeline and yields an item with its image_urls set to ['https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg'].

marlenachatzigrigoriou commented 3 years ago

I've checked the related code and documentation. I've tried to recreate a spider as @Gallaecio indicated, using ImagePipeline but it doesn't store any of the images. @apalala's suggestion should be embedded in scrapy.pipelines.images.ImagesPipeline image_downloaded method? I think I need a little bit more guidance.

Gallaecio commented 3 years ago

I believe the main issue here is that if an item has 1 valid image URL after 1 URL to an image that Pillow fails to parse, none of the 2 images get downloaded.

So to reproduce the issue it would be better to use something like ['https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg', 'https://scrapy.org/img/scrapylogo.png'].

@apalala's suggestion should be embedded in scrapy.pipelines.images.ImagesPipeline image_downloaded method?

I believe so.

You could first try with the current Scrapy implementation and ['https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg', 'https://scrapy.org/img/scrapylogo.png'], and verify that none of the 2 images get downloaded.

Then try adding @apalala’s code to your local clone of Scrapy, installing that version of Scrapy (i.e. pip install -e <path to your local Scrapy clone>), and running the spider again, see if this time around the Scrapy logo gets downloaded.

marlenachatzigrigoriou commented 3 years ago

I have reproduced the issue as @Gallaecio indicated and I am trying to fix the code to my local clone. The purpose is to download both images (valid and invalid), or just the valid one, when image_urls is set to ['https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg', 'https://scrapy.org/img/scrapylogo.png']?

Gallaecio commented 3 years ago

I believe the goal is for valid images to get downloaded, to prevent invalid images (which are not expected to be downloaded) to cause valid images to not be downloaded either. For invalid images, as long as there’s some feedback in the logs about them being invalid, I think that’s OK. Users that want invalid images can always switch to the files media pipeline, which downloads all files without any kind of image processing.

apalala commented 3 years ago

Note that this should be replaced by specific exception types (like OSError) and handling to avoid the risk of an infinite loop:

except Exception:
    continue

The sample code relies too much on the iterator eventually reaching StopIteration.

except OSError as e:
    logger.exception('Could not process image')
except Exception as e:
   logger.exception('Stopped processing images')
   break
marlenachatzigrigoriou commented 3 years ago

I have combined @apalala's code with the existing one (in ImagesPipeline image_downloaded method). The valid images are downloaded while the invalid ones not, and the OSError is faced and not raised anymore. Should I continue with the creation of a test in the test/test_pipeline_images.py file?

Gallaecio commented 3 years ago

@marlenachatzigrigoriou That would be great. You could also already create a (draft) pull request with what you have so far, so that we can provide some feedback.

spittieUM commented 3 years ago

Hello,

I know this issue looks to be resolved, but I worked on the bug as well. Will link a pull request that fixes the error and adds a test.

Thanks.

SabbirAhmedAdor629 commented 3 years ago

@apalala reported this issue when a site sends some invalid images, and suggested a fix for the spider middleware:

        image_stream = self.get_images(response, request, info, item=item)
        while True:
            try:
                path, image, buf = next(image_stream)
            except StopIteration:
                break
            except Exception:
                continue

I’ve not checked if there’s a cleaner fix possible (i.e. closer to the call to Pillow code), but it should be trivial to fix either way. I suspect writing a test may be the hardest part here.

Hi, I want to work on this is it still available?

Gallaecio commented 3 years ago

There’s 2 on-going pull requests already to fix it. Unless you want to review and provide feedback to those pull requests, it might be best to work on something else.

Gallaecio commented 3 years ago

@apalala @marlenachatzigrigoriou @spittieUM Looking deeper into the code to provide better feedback for your solutions, it started to seem to me that this issue should not be happening at all, because exceptions from the get_images method should not block the download of other images.

To be sure, I went ahead and tested the issue myself, creating the following file and running it with scrapy runspider in a Python virtual environment with Scrapy and Pillow installed:

from scrapy import Spider

class TestSpider(Spider):
    name = 'test'
    start_urls = ['http://toscrape.com/']

    custom_settings = {
        'IMAGES_STORE': '',  # TODO: Set your own download path
        'ITEM_PIPELINES': {
            'scrapy.pipelines.images.ImagesPipeline': 1,
        }
    }

    def parse(self, response):
        yield {
            'image_urls': [
                'http://toscrape.com/img/zyte.png',
                'https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg',
                'http://toscrape.com/img/books.png',
            ],
        }

This spider did as expected: the 2 good images were downloaded, the bad one was not downloaded.

@apalala Either I’m not understanding the issue correctly, or there is no issue.

marlenachatzigrigoriou commented 3 years ago

@Gallaecio I agree. From the first time, without changing the code, the spider was downloading the valid images and not the invalid ones. In my case, the raised OSError exception was visible in the command line but wasn't affecting the download. In this way, I thought that the issue was to face the OSError so it won't be raised anymore.

apalala commented 3 years ago

@Gallaecio, it may be that the only issue is that having an OSError stack trace escape to the logs is unexpected.

Gallaecio commented 3 years ago

@apalala Shall we close this issue then, or would you want to hide that error or remove the traceback?

marlenachatzigrigoriou commented 3 years ago

@Gallaecio, if @apalala wants to remove the traceback, I think that #5097 solves it. But, I guess that the tests need more work.

apalala commented 3 years ago

I'd say to move to close if the OSError does not scape to the logs.

tonycoldashian commented 2 years ago

Hey! It's my first time working with Open Source and I'm interested in this and would like to help.

jmat94 commented 2 years ago

Hi! I am new to Open Source and I would like to contribute by coming up with a solution to this issue.

Gallaecio commented 2 years ago

I'd say to move to close if the OSError does not scape to the logs.

Closing, then.

@marlenachatzigrigoriou I’m really sorry for wasting your time, I should have tested this myself before opening the issue report.

marlenachatzigrigoriou commented 2 years ago

That's fine @Gallaecio. Thank you for the help in my first effort to contribute, no matter the ending of this issue!