python / cpython

The Python programming language
https://www.python.org
Other
62.75k stars 30.08k forks source link

robotparser reads empty robots.txt file as "all denied" #79638

Open e9cbf7f3-bb18-4403-9273-9b5b739582ba opened 5 years ago

e9cbf7f3-bb18-4403-9273-9b5b739582ba commented 5 years ago
BPO 35457
Nosy @terryjreedy, @berkerpeksag, @tirkarthi, @andreburgaud

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', 'type-feature', 'library'] title = 'robotparser reads empty robots.txt file as "all denied"' updated_at = user = 'https://bugs.python.org/larsfuse' ``` bugs.python.org fields: ```python activity = actor = 'gallicrooster' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'larsfuse' dependencies = [] files = [] hgrepos = [] issue_num = 35457 keywords = [] message_count = 6.0 messages = ['331595', '331870', '331963', '359180', '359185', '359202'] nosy_count = 5.0 nosy_names = ['terry.reedy', 'berker.peksag', 'xtreak', 'larsfuse', 'gallicrooster'] pr_nums = [] priority = 'normal' resolution = None stage = 'test needed' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue35457' versions = ['Python 3.8'] ```

e9cbf7f3-bb18-4403-9273-9b5b739582ba commented 5 years ago

The standard (http://www.robotstxt.org/robotstxt.html) says:

To allow all robots complete access: User-agent: * Disallow: (or just create an empty "/robots.txt" file, or don't use one at all)

Here I give python an empty file: $ curl http://10.223.68.186/robots.txt $

Code:

rp = robotparser.RobotFileParser()
print (robotsurl)
rp.set_url(robotsurl)
rp.read()
print( "fetch /", rp.can_fetch(useragent = "*", url = "/"))
print( "fetch /admin", rp.can_fetch(useragent = "*", url = "/admin"))

Result:

$ ./test.py
http://10.223.68.186/robots.txt
('fetch /', False)
('fetch /admin', False)

And the result is, robotparser thinks the site is blocked.

terryjreedy commented 5 years ago

https://docs.python.org/2.7/library/robotparser.html#module-robotparser and https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser refers users, for file structure, to http://www.robotstxt.org/orig.html. This says nothing about the effect of an empty file, so I don't see this as a bug. Even if it was, I would be dubious about reversing the effect without a deprecation notice first, and definitely not in 2.7.

I would propose instead that the doc be changed to refer to the new file, with more and better examples, but add a note that robotparser interprets empty files as 'block all' rather than 'allow all'.

Try bringing this up on python-ideas.

e9cbf7f3-bb18-4403-9273-9b5b739582ba commented 5 years ago

(...) refers users, for file structure, to http://www.robotstxt.org/orig.html. This says nothing about the effect of an empty file, so I don't see this as a bug.

That is incorrect. From that url you can find:

The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

So this is definitely a bug.

cc4d6bfd-ed6f-418e-9489-8f20fba59555 commented 4 years ago

Hi,

Is this ticket still relevant for Python 3.8?

While running some tests with an empty robotstxt file I realized that it was returning "ALLOWED" for any path (as per the current draft of the Robots Exclusion Protocol: https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.1 ")

Code:

from urllib import robotparser

robots_url = "file:///tmp/empty.txt"

rp = robotparser.RobotFileParser()
print(robots_url)
rp.set_url(robots_url)
rp.read()
print( "fetch /", rp.can_fetch(useragent = "*", url = "/"))
print( "fetch /admin", rp.can_fetch(useragent = "*", url = "/admin"))

Output:

$ cat /tmp/empty.txt
$ python -V
Python 3.8.1
$ python test_robot3.py
file:///tmp/empty.txt
fetch / True
fetch /admin True
tirkarthi commented 4 years ago

There is a behavior change. parse() sets the modified time and unless the modified time is set the can_fetch method returns false. In Python 2 the parse method was called only when the file is non-empty [0] but in Python 3 it's always called though the file is empty [1] . The change was done with 1afc1696167547a5fa101c53e5a3ab4717f8852c to always read parse and then in 122541beceeccce4ef8a9bf739c727ccdcbf2f28 modified function was always called during parse thus setting the modified_time to return True from can_fetch in the end.

I think the behavior of robotparser for empty file was undefined allowing these changes and it will be good to have a test for this behavior.

[0] https://github.com/python/cpython/blob/f82e59ac4020a64c262a925230a8eb190b652e87/Lib/robotparser.py#L66-L67 [1] https://github.com/python/cpython/blob/149175c6dfc8455023e4335575f3fe3d606729f9/Lib/urllib/robotparser.py#L69-L70

cc4d6bfd-ed6f-418e-9489-8f20fba59555 commented 4 years ago

Thanks @xtreak for providing some clarification on this behavior! I can write some tests to cover this behavior, assuming that we agree that an empty file means "unlimited access". This was worded as such in the old internet draft from 1996 (section 3.2.1 in https://www.robotstxt.org/norobots-rfc.txt). The current draft is more ambiguous with "If no group satisfies either condition, or no groups are present at all, no rules apply." https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.1

https://www.robotstxt.org/robotstxt.html clearly states that an empty file gives full access, but I'm getting lost in figuring out which is the official spec at the moment :-)