Closed fkromer closed 1 month ago
We can probably follow Google’s approach:
Similarly, if the character encoding of the robots. txt file isn't UTF-8, Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots. txt rules invalid.
So fixing this requires changing the offending line to:
robotstxt_body.decode("utf-8", errors="ignore")
And implementing a test for it.
@Gallaecio I would like to work on this.
Please, feel free to open a PR. Thanks!
Summary
robots.txt
files which are notutf-8
encoded makescrapy
raise anUnicodeDecodeError
atm.Motivation
Prevents from being able to manage all scrapers in an automated fashion, like described in "Describe alternatives you've considered".
Describe alternatives you've considered
Check
robots.txt
file manually for the scraper in question, disablerobots.txt
file parsing globally usingROBOTSTXT_OBEY
. Prevents to being able to process all scrapers in an automated fashion however.Additional context