scrapy / protego

A pure-Python robots.txt parser with support for modern conventions.
BSD 3-Clause "New" or "Revised" License
54 stars 28 forks source link

Select applied rule by longest pattern length #24

Closed sseveran closed 1 year ago

sseveran commented 2 years ago

Implements selecting the applied rule by the longest matching pattern. This matches Google's described logic.

Gallaecio commented 1 year ago

The problem here is not the pattern length, is the leading asterisk, which does not appear in any of the Google examples, ~making me wonder whether it is something valid in the first place~.

Gallaecio commented 1 year ago

https://github.com/scrapy/protego/pull/34 looks like the best fix here, and it should have no performance impact.

ghost commented 1 year ago

34 looks like the best fix here, and it should have no performance impact.

34 looks like the best fix here, and it should have no performance impact.

Maybe not a performance impact per se, on the inner logic, but the outer user interface would be more restrictive. This isn't my project I'm just trying to help.

sseveran commented 1 year ago

@starrtennis I made PR to solve a specific problem I was having. I have not considered making it more general. If I have time I can revisit in the future but this has worked well enough for the crawling I am doing for now.

I am also fine with closing this PR if its desired and maintaining my own fork with this functionality.

Gallaecio commented 1 year ago

@sseveran Could you check if https://github.com/scrapy/protego/pull/34 works for you?