scrapy / protego

A pure-Python robots.txt parser with support for modern conventions.
BSD 3-Clause "New" or "Revised" License
54 stars 28 forks source link

Cannot fetch non-disallowed domain #4

Closed Gallaecio closed 4 years ago

Gallaecio commented 4 years ago

https://www.walkerplus.com/robots.txt:

user-agent: *
disallow: http://ms-web00.walkerplus.com/
disallow: http://www-origin.walkerplus.com/
disallow: http://walkerplus.jp/
disallow: http://walkerplus.net/
disallow: https://ms.walkerplus.com/

user-agent: twitterbot
disallow:

Unexpectedly:

>>> rp.can_fetch("https://www.walkerplus.com/", "mybot")
False

Originally reported at https://github.com/scrapy/scrapy/issues/4145

Gallaecio commented 4 years ago

The content of the robots.txt file is incorrect, as they should not contain absolute URLs, but we should support it nonetheless if real websites may use it.