scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
50.99k stars 10.34k forks source link

Handle robots.txt files not utf-8 encoded #6292

Closed fkromer closed 1 month ago

fkromer commented 1 month ago

Summary

robots.txt files which are not utf-8 encoded make scrapy raise an UnicodeDecodeError atm.

.venv/lib/python3.11/site-packages/scrapy/robotstxt.py", line 15, in decode_robotstxt
    robotstxt_body = robotstxt_body.decode("utf-8")
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 46: invalid continuation byte

Motivation

Prevents from being able to manage all scrapers in an automated fashion, like described in "Describe alternatives you've considered".

Describe alternatives you've considered

Check robots.txt file manually for the scraper in question, disable robots.txt file parsing globally using ROBOTSTXT_OBEY. Prevents to being able to process all scrapers in an automated fashion however.

Additional context

Gallaecio commented 1 month ago

We can probably follow Google’s approach:

Similarly, if the character encoding of the robots. txt file isn't UTF-8, Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots. txt rules invalid.

So fixing this requires changing the offending line to:

robotstxt_body.decode("utf-8", errors="ignore")

And implementing a test for it.

HenrySchwerdt commented 1 month ago

@Gallaecio I would like to work on this.

Gallaecio commented 1 month ago

Please, feel free to open a PR. Thanks!