Closed Gallaecio closed 2 years ago
Hello, I am new here. I would like to contribute to this issue.
Let me gather a few broken images from fandangonow.com on the next run, @Gallaecio.
The exception raised by PIL is OSError
.
Examples of broken images:
@marlenachatzigrigoriou Thank you! Please, let us know if you have any question.
It should be possible to reproduce this issue by creating a Scrapy spider that uses ImagePipeline and yields an item with its image_urls
set to ['https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg']
.
I've checked the related code and documentation. I've tried to recreate a spider as @Gallaecio indicated, using ImagePipeline but it doesn't store any of the images. @apalala's suggestion should be embedded in scrapy.pipelines.images.ImagesPipeline image_downloaded method? I think I need a little bit more guidance.
I believe the main issue here is that if an item has 1 valid image URL after 1 URL to an image that Pillow fails to parse, none of the 2 images get downloaded.
So to reproduce the issue it would be better to use something like ['https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg', 'https://scrapy.org/img/scrapylogo.png']
.
@apalala's suggestion should be embedded in scrapy.pipelines.images.ImagesPipeline image_downloaded method?
I believe so.
You could first try with the current Scrapy implementation and ['https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg', 'https://scrapy.org/img/scrapylogo.png']
, and verify that none of the 2 images get downloaded.
Then try adding @apalala’s code to your local clone of Scrapy, installing that version of Scrapy (i.e. pip install -e <path to your local Scrapy clone>
), and running the spider again, see if this time around the Scrapy logo gets downloaded.
I have reproduced the issue as @Gallaecio indicated and I am trying to fix the code to my local clone. The purpose is to download both images (valid and invalid), or just the valid one, when image_urls
is set to ['https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg', 'https://scrapy.org/img/scrapylogo.png']
?
I believe the goal is for valid images to get downloaded, to prevent invalid images (which are not expected to be downloaded) to cause valid images to not be downloaded either. For invalid images, as long as there’s some feedback in the logs about them being invalid, I think that’s OK. Users that want invalid images can always switch to the files media pipeline, which downloads all files without any kind of image processing.
Note that this should be replaced by specific exception types (like OSError
) and handling to avoid the risk of an infinite loop:
except Exception:
continue
The sample code relies too much on the iterator eventually reaching StopIteration
.
except OSError as e:
logger.exception('Could not process image')
except Exception as e:
logger.exception('Stopped processing images')
break
I have combined @apalala's code with the existing one (in ImagesPipeline image_downloaded method). The valid images are downloaded while the invalid ones not, and the OSError
is faced and not raised anymore. Should I continue with the creation of a test in the test/test_pipeline_images.py file?
@marlenachatzigrigoriou That would be great. You could also already create a (draft) pull request with what you have so far, so that we can provide some feedback.
Hello,
I know this issue looks to be resolved, but I worked on the bug as well. Will link a pull request that fixes the error and adds a test.
Thanks.
@apalala reported this issue when a site sends some invalid images, and suggested a fix for the spider middleware:
image_stream = self.get_images(response, request, info, item=item) while True: try: path, image, buf = next(image_stream) except StopIteration: break except Exception: continue
I’ve not checked if there’s a cleaner fix possible (i.e. closer to the call to Pillow code), but it should be trivial to fix either way. I suspect writing a test may be the hardest part here.
Hi, I want to work on this is it still available?
There’s 2 on-going pull requests already to fix it. Unless you want to review and provide feedback to those pull requests, it might be best to work on something else.
@apalala @marlenachatzigrigoriou @spittieUM Looking deeper into the code to provide better feedback for your solutions, it started to seem to me that this issue should not be happening at all, because exceptions from the get_images
method should not block the download of other images.
To be sure, I went ahead and tested the issue myself, creating the following file and running it with scrapy runspider
in a Python virtual environment with Scrapy and Pillow installed:
from scrapy import Spider
class TestSpider(Spider):
name = 'test'
start_urls = ['http://toscrape.com/']
custom_settings = {
'IMAGES_STORE': '', # TODO: Set your own download path
'ITEM_PIPELINES': {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
}
def parse(self, response):
yield {
'image_urls': [
'http://toscrape.com/img/zyte.png',
'https://img01.mgo-images.com/image/thumbnail/v2/content/MMV5CEBE0C5DC82A5028E7EA6B91F904DEF3.jpeg',
'http://toscrape.com/img/books.png',
],
}
This spider did as expected: the 2 good images were downloaded, the bad one was not downloaded.
@apalala Either I’m not understanding the issue correctly, or there is no issue.
@Gallaecio I agree. From the first time, without changing the code, the spider was downloading the valid images and not the invalid ones. In my case, the raised OSError exception was visible in the command line but wasn't affecting the download. In this way, I thought that the issue was to face the OSError so it won't be raised anymore.
@Gallaecio, it may be that the only issue is that having an OSError
stack trace escape to the logs is unexpected.
@apalala Shall we close this issue then, or would you want to hide that error or remove the traceback?
@Gallaecio, if @apalala wants to remove the traceback, I think that #5097 solves it. But, I guess that the tests need more work.
I'd say to move to close if the OSError
does not scape to the logs.
Hey! It's my first time working with Open Source and I'm interested in this and would like to help.
Hi! I am new to Open Source and I would like to contribute by coming up with a solution to this issue.
I'd say to move to close if the
OSError
does not scape to the logs.
Closing, then.
@marlenachatzigrigoriou I’m really sorry for wasting your time, I should have tested this myself before opening the issue report.
That's fine @Gallaecio. Thank you for the help in my first effort to contribute, no matter the ending of this issue!
@apalala reported this issue when a site sends some invalid images, and suggested a fix for the spider middleware:
I’ve not checked if there’s a cleaner fix possible (i.e. closer to the call to Pillow code), but it should be trivial to fix either way. I suspect writing a test may be the hardest part here.