python / cpython

The Python programming language
https://www.python.org
Other
62.1k stars 29.85k forks source link

urllib.robotparser doesn't treat the "*" path correctly #114310

Open tognee opened 7 months ago

tognee commented 7 months ago

Bug report

Bug description:

https://github.com/python/cpython/blob/3.12/Lib/urllib/robotparser.py#L227

self.path == "*" will never be true because of this line:

https://github.com/python/cpython/blob/3.12/Lib/urllib/robotparser.py#L114

That converts the * character to %2A

Proposed solution

Change in line 227 self.path == "*" to self.path == "%2A"

CPython versions tested on:

3.12, 3.13, CPython main branch

Operating systems tested on:

Linux

ronaldoussoren commented 7 months ago

According to the spec referenced in the documentation a robots.txt file does not have wildcards in the path.

Could you expand a little on what exactly doesn't work for you, for example:

tognee commented 7 months ago

I was trying to check if a website is indexable by Google ("Googlebot" User-agent).

I found out about this issue while checking a GlobaLeaks instance after disabling the option to index the website on search engines.

Here's an example:

>>> from urllib.robotparser import RobotFileParser
>>> lines = [
...     "User-agent: *",
...     "Disallow: *"
... ]
>>> robot_parser = RobotFileParser()
>>> robot_parser.parse(lines)
>>> print(robot_parser.default_entry)
User-agent: *
Disallow: %2A
>>> robot_parser.can_fetch("Googlebot", "https://example.com/")
True

The expected result should be False, as the robots.txt should block all routes for all user agents.

Checking the code of urllib.robotparser there is a check if the path after Allow or Disallow is a "", but the routeline's path gets encoded when parsing the file. So the check `line.path == ""inapplies_to` can never be True.

The solution should be:

ronaldoussoren commented 7 months ago

The spec I linked to says:

Disallow: The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Google's documentation on robots.txt says the same at: https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt

A "disallow: *" line seems incorrect to me, the value after "disallow:" should be a full path, starting with a "/". To disallow anything use "disallow: /".

From what I've read so far robotparser does the correct thing here and the site is configured incorrectly.

tognee commented 7 months ago

So what's the check in applies_to actually doing?

If it's not needed I think it can be removed

tognee commented 7 months ago

It's been there since the last rewrite, can this be a regression?

https://github.com/python/cpython/commit/663f6c2ad288b2e8bc1a1a50d29e12df4f755d5b

ronaldoussoren commented 7 months ago

So what's the check in applies_to actually doing?

TBH, I don't know for sure.

The check for '*' seems to be not necessary, but that's purely based on reading the documentation, I haven't done enough with robots.txt files to be sure that the spec completely matches real-world behaviour. That said, the Google document I linked to also doesn't mention wildcards as an option.

The wikipedia page also states that "Disallow: " isn't mentioned in the spec, and links to RFC 9309 which has a formal specification for robots.txt. That RFC says that the path in rules start with a "/" and can contain "" wildcards (e.g. "Allow: /foo/*/bar").

I haven't compared the implementation with the spec, and won't have time to do so myself for quite some time.

rtb-zla-karma commented 7 months ago

I will give more complex example that doesn't work thanks to this URL encoding.

>>> from urllib.robotparser import RobotFileParser
>>> rtxt = """
... User-agent: *
... Allow: /*.css$
... Allow: /wp-admin/admin-ajax.php
... Disallow: /wp-admin/
... Disallow: /*/attachment/
... """
>>> p = RobotFileParser()
>>> p.parse(rtxt.split('\n'))
>>> print(p)
User-agent: *
Allow: /%2A.css%24
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Disallow: /%2A/attachment/
>>> p.can_fetch('*', 'https://www.example.com/hi/attachment/')  # <-- shouldn't allow
True
>>> p.can_fetch('*', 'https://www.example.com/%2A/attachment/')  # <-- ok but that means it matches "%2A" literarly
False
>>>
>>> rtxt2 = """  # encoded version gives the same results
... User-agent: *
... Allow: /%2A.css%24
... Allow: /wp-admin/admin-ajax.php
... Disallow: /wp-admin/
... Disallow: /%2A/attachment/
... """
>>> p2 = RobotFileParser()
>>> p2.parse(rtxt.split('\n'))
>>> print(p2)
User-agent: *
Allow: /%2A.css%24
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Disallow: /%2A/attachment/
>>> p2.can_fetch('*', 'https://www.example.com/hi/attachment/')
True
>>> p2.can_fetch('*', 'https://www.example.com/%2A/attachment/')
False

Using these wildcards is quite common, example: https://www.last.fm/robots.txt .

ronaldoussoren commented 7 months ago

Using these wildcards is quite common, example: https://www.last.fm/robots.txt .

Thanks, that shows that the spec we link to is no longer the best spec and that the robotparser needs some love.

Note that adding support for these is a new feature (robotparser currently does not support lines with wildcards) and not a bug fix.

Leaving the "bug" tag as well because "*" support seems to be intentional and is broken. It is also not specified in any spec I have found, it might be better to just drop the feature unless someone finds real usage of these and some kind of specification.

rtb-zla-karma commented 7 months ago

I think it somewhat depends what is a bug and what is a feature.

I agree that support for * wildcard is a feature request given what the parser is based on.

But handling of "*" in consideration that you don't treat it as wildcard but verbatim (I suppose, can't tell really) seems like a bug. I know that if I encode URL before passing to can_fetch it will give expected result. However this is not obvious thing to do, for me at least. Can you elaborate on that? Does it need separate issue to be considered?