Open tognee opened 7 months ago
According to the spec referenced in the documentation a robots.txt file does not have wildcards in the path.
Could you expand a little on what exactly doesn't work for you, for example:
I was trying to check if a website is indexable by Google ("Googlebot" User-agent).
I found out about this issue while checking a GlobaLeaks instance after disabling the option to index the website on search engines.
Here's an example:
>>> from urllib.robotparser import RobotFileParser
>>> lines = [
... "User-agent: *",
... "Disallow: *"
... ]
>>> robot_parser = RobotFileParser()
>>> robot_parser.parse(lines)
>>> print(robot_parser.default_entry)
User-agent: *
Disallow: %2A
>>> robot_parser.can_fetch("Googlebot", "https://example.com/")
True
The expected result should be False
, as the robots.txt should block all routes for all user agents.
Checking the code of urllib.robotparser there is a check if the path after Allow or Disallow is a "", but the routeline's path gets encoded when parsing the file. So the check `line.path == ""in
applies_to` can never be True.
The solution should be:
applies_to
to "%2A"The spec I linked to says:
Disallow: The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
Google's documentation on robots.txt says the same at: https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
A "disallow: *" line seems incorrect to me, the value after "disallow:" should be a full path, starting with a "/". To disallow anything use "disallow: /".
From what I've read so far robotparser
does the correct thing here and the site is configured incorrectly.
So what's the check in applies_to
actually doing?
If it's not needed I think it can be removed
It's been there since the last rewrite, can this be a regression?
https://github.com/python/cpython/commit/663f6c2ad288b2e8bc1a1a50d29e12df4f755d5b
So what's the check in
applies_to
actually doing?
TBH, I don't know for sure.
The check for '*' seems to be not necessary, but that's purely based on reading the documentation, I haven't done enough with robots.txt files to be sure that the spec completely matches real-world behaviour. That said, the Google document I linked to also doesn't mention wildcards as an option.
The wikipedia page also states that "Disallow: " isn't mentioned in the spec, and links to RFC 9309 which has a formal specification for robots.txt. That RFC says that the path in rules start with a "/" and can contain "" wildcards (e.g. "Allow: /foo/*/bar").
I haven't compared the implementation with the spec, and won't have time to do so myself for quite some time.
I will give more complex example that doesn't work thanks to this URL encoding.
>>> from urllib.robotparser import RobotFileParser
>>> rtxt = """
... User-agent: *
... Allow: /*.css$
... Allow: /wp-admin/admin-ajax.php
... Disallow: /wp-admin/
... Disallow: /*/attachment/
... """
>>> p = RobotFileParser()
>>> p.parse(rtxt.split('\n'))
>>> print(p)
User-agent: *
Allow: /%2A.css%24
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Disallow: /%2A/attachment/
>>> p.can_fetch('*', 'https://www.example.com/hi/attachment/') # <-- shouldn't allow
True
>>> p.can_fetch('*', 'https://www.example.com/%2A/attachment/') # <-- ok but that means it matches "%2A" literarly
False
>>>
>>> rtxt2 = """ # encoded version gives the same results
... User-agent: *
... Allow: /%2A.css%24
... Allow: /wp-admin/admin-ajax.php
... Disallow: /wp-admin/
... Disallow: /%2A/attachment/
... """
>>> p2 = RobotFileParser()
>>> p2.parse(rtxt.split('\n'))
>>> print(p2)
User-agent: *
Allow: /%2A.css%24
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Disallow: /%2A/attachment/
>>> p2.can_fetch('*', 'https://www.example.com/hi/attachment/')
True
>>> p2.can_fetch('*', 'https://www.example.com/%2A/attachment/')
False
Using these wildcards is quite common, example: https://www.last.fm/robots.txt .
Using these wildcards is quite common, example: https://www.last.fm/robots.txt .
Thanks, that shows that the spec we link to is no longer the best spec and that the robotparser needs some love.
Note that adding support for these is a new feature (robotparser currently does not support lines with wildcards) and not a bug fix.
Leaving the "bug" tag as well because "*" support seems to be intentional and is broken. It is also not specified in any spec I have found, it might be better to just drop the feature unless someone finds real usage of these and some kind of specification.
I think it somewhat depends what is a bug and what is a feature.
I agree that support for *
wildcard is a feature request given what the parser is based on.
But handling of "*" in consideration that you don't treat it as wildcard but verbatim (I suppose, can't tell really) seems like a bug. I know that if I encode URL before passing to can_fetch
it will give expected result. However this is not obvious thing to do, for me at least. Can you elaborate on that? Does it need separate issue to be considered?
Bug report
Bug description:
https://github.com/python/cpython/blob/3.12/Lib/urllib/robotparser.py#L227
self.path == "*"
will never betrue
because of this line:https://github.com/python/cpython/blob/3.12/Lib/urllib/robotparser.py#L114
That converts the
*
character to%2A
Proposed solution
Change in line 227
self.path == "*"
toself.path == "%2A"
CPython versions tested on:
3.12, 3.13, CPython main branch
Operating systems tested on:
Linux