Closed Gallaecio closed 4 years ago
https://www.walkerplus.com/robots.txt:
user-agent: * disallow: http://ms-web00.walkerplus.com/ disallow: http://www-origin.walkerplus.com/ disallow: http://walkerplus.jp/ disallow: http://walkerplus.net/ disallow: https://ms.walkerplus.com/ user-agent: twitterbot disallow:
Unexpectedly:
>>> rp.can_fetch("https://www.walkerplus.com/", "mybot") False
Originally reported at https://github.com/scrapy/scrapy/issues/4145
The content of the robots.txt file is incorrect, as they should not contain absolute URLs, but we should support it nonetheless if real websites may use it.
robots.txt
https://www.walkerplus.com/robots.txt:
Unexpectedly:
Originally reported at https://github.com/scrapy/scrapy/issues/4145