Handle robots.txt files not utf-8 encoded

fkromer commented 1 month ago

Summary

robots.txt files which are not utf-8 encoded make scrapy raise an UnicodeDecodeError atm.

.venv/lib/python3.11/site-packages/scrapy/robotstxt.py", line 15, in decode_robotstxt
    robotstxt_body = robotstxt_body.decode("utf-8")
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 46: invalid continuation byte

Motivation

Prevents from being able to manage all scrapers in an automated fashion, like described in "Describe alternatives you've considered".

Describe alternatives you've considered

Check robots.txt file manually for the scraper in question, disable robots.txt file parsing globally using ROBOTSTXT_OBEY. Prevents to being able to process all scrapers in an automated fashion however.

Additional context

Gallaecio commented 1 month ago

We can probably follow Google’s approach:

Similarly, if the character encoding of the robots. txt file isn't UTF-8, Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots. txt rules invalid.

So fixing this requires changing the offending line to:

robotstxt_body.decode("utf-8", errors="ignore")

And implementing a test for it.

HenrySchwerdt commented 1 month ago

@Gallaecio I would like to work on this.

Gallaecio commented 1 month ago

Please, feel free to open a PR. Thanks!

scrapy / scrapy