temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language
MIT License
269 stars 55 forks source link

Different behavior on Google Webmaster Tools robots.txt checker and robotstxt-go #15

Closed uforic closed 8 years ago

uforic commented 8 years ago

I noticed that on Google Webmaster Tools robots.txt checker, the following robots.txt:

User-agent: *
Allow: /
Allow: /blog/*
Disallow: /*/*

will allow website.com/blog/article, as well as website.com/blog/article/.

However, when tested against robotstxt-go, only website.com/blog/article is allowed through, and not website.com/blog/article/. I must add an additional line for robotstxt-go to allow the second URL through, so my robots.txt looks more like:

User-agent: *
Allow: /
Allow: /blog/*
Allow: /blog/*/
Disallow: /*/*

I'm running robotstxt-go as the GoogleBot user-agent. Any other thoughts on whether this is expected behavior / why this might be happening?

Thanks!

temoto commented 8 years ago

This seems like a bug in parser, please wait.

temoto commented 8 years ago

@uforic please see attached commit, there's a new test for wildcard suffix, but it passes without changing any code. Maybe your robots.txt where it fails is a bit more complicated?

uforic commented 8 years ago

Apologies, I realize it had to do with some conflicting rules in my robots.txt. Sorry!