Open s-ferri-fortop opened 11 months ago
While working on a fork, I have found another solution:
Adding two additional leading slashes if the pattern starts with "//" ensures that urlparse does not confuse the first folder with the hostname (netloc). At the same time, path is as expected (e.g.):
def _quote_pattern(self, pattern):
if pattern.startswith("https://") or pattern.startswith("http://") :
pattern = "/" + pattern
elif pattern.startswith("//") :
pattern = "//" + pattern
Urlparse will behave as follow:
input pattern: //debug/*
modified pattern: ////debug/*
ParseResult(scheme='', netloc='', path='//debug/*', params='', query='', fragment='')
I do not have experience with testing, so any help is appreciated, but I keep working on the pull request :)
When analyzing the following robots.txt, Protego parses the directive Disallow: //debug/ as if it was /
This is due to the following line of code: https://github.com/scrapy/protego/blob/45e1948702c52d82347755b593f6884f844b8917/src/protego.py#L185
The problem is that urlparse does not parse the URL as expected (i.e. as a path) and takes "debug" as the hostname:
According to Google's official documentation, the Allow and Disallow directives must be followed by relative paths starting with a / character.
Therefore, I see two possible solutions:
Option 1 As is: https://github.com/scrapy/protego/blob/45e1948702c52d82347755b593f6884f844b8917/src/protego.py#L185-L186
To be:
Option 2 Add a re.sub at the beginning of the following method: https://github.com/scrapy/protego/blob/45e1948702c52d82347755b593f6884f844b8917/src/protego.py#L90-L93