scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

https://scrapy.org

BSD 3-Clause "New" or "Revised" License

50.99k stars 10.34k forks source link

Edits to media_downloaded in files.py to handle 201 response status (#1615 and #1806) #6309

Open zoemyatt opened 4 weeks ago

zoemyatt commented 4 weeks ago

1615 and #1806 discussed the importance of handling a 201 response status in `pipelines/files.py`. We have attempted to correct this error by not only checking for 200 response status but also checking for 201 response status and location headers in the newly created image. If the new image has a 201 status and a location header, the new program schedules an extra request for the file data.

We then added a unit test to scrape a URL we knew would return a 201 status code and ran tox with the test_pipeline_media.py file.

zoemyatt commented 3 weeks ago

@wRAR let me know if these changes look better. Sorry about the extra file in there!

We improved the test case to check for a 201 status code without using a real URL and deleted the extra file.

Gallaecio commented 3 weeks ago

Hey, you might want to install pre-commit.

Gallaecio commented 2 weeks ago

Have you tried this fix in a real scenario?

The method you modified is meant to return a dict. You have modified it to return a Request for 201, but I don’t imagine the calling method that expects a dict will handle a Request. And your test verifies that the modified method returns a Request, it does not verify that the caller handles that request as needed.

I wonder if it would not be better to have some more general feature to follow redirects on empty 201 responses. Maybe modify the redirect middleware to do that if some new setting is enabled.