scrapy / protego

A pure-Python robots.txt parser with support for modern conventions.
BSD 3-Clause "New" or "Revised" License
54 stars 28 forks source link

URL starting with double-slash are misinterpreted #44

Open s-ferri-fortop opened 11 months ago

s-ferri-fortop commented 11 months ago

When analyzing the following robots.txt, Protego parses the directive Disallow: //debug/ as if it was /

User-agent: *
Disallow: //debug/*

This is due to the following line of code: https://github.com/scrapy/protego/blob/45e1948702c52d82347755b593f6884f844b8917/src/protego.py#L185

The problem is that urlparse does not parse the URL as expected (i.e. as a path) and takes "debug" as the hostname:

from urllib.parse import urlparse
print(urlparse("//debug/*"))
### result: ParseResult(scheme='', netloc='debug', path='/*', params='', query='', fragment='')

According to Google's official documentation, the Allow and Disallow directives must be followed by relative paths starting with a / character.

Therefore, I see two possible solutions:

  1. avoid using urlparse on directives' patterns
  2. replace the starting double initial slashes with a single slash

Option 1 As is: https://github.com/scrapy/protego/blob/45e1948702c52d82347755b593f6884f844b8917/src/protego.py#L185-L186

To be:

pattern = self._unquote(pattern, ignore="/*$%")

Option 2 Add a re.sub at the beginning of the following method: https://github.com/scrapy/protego/blob/45e1948702c52d82347755b593f6884f844b8917/src/protego.py#L90-L93

pattern = re.sub(r"^[/]{2,}", "*", pattern)
s-ferri-fortop commented 11 months ago

While working on a fork, I have found another solution:

Adding two additional leading slashes if the pattern starts with "//" ensures that urlparse does not confuse the first folder with the hostname (netloc). At the same time, path is as expected (e.g.):

def _quote_pattern(self, pattern):
        if pattern.startswith("https://") or pattern.startswith("http://") :
            pattern = "/" + pattern
        elif pattern.startswith("//") :
            pattern = "//" + pattern

Urlparse will behave as follow:

input pattern: //debug/*
modified pattern: ////debug/*
ParseResult(scheme='', netloc='', path='//debug/*', params='', query='', fragment='')

I do not have experience with testing, so any help is appreciated, but I keep working on the pull request :)