Closed SpiderBro closed 8 years ago
That's a good one. Let me see what we can do ...
I have reviewed the example you provided and I am now confused :)
According to Google, the tailing wildcard after "/" will be ignored.
"/*" => "/" == "/$" (?)
Which means that you are trying to forbid the home page and allow it at the same time. If I get it right, you are trying to disallow all pages except for the home page, right?
I will do some coding to prevent this, but probably later. If you have a working code snippet, please consider creating a pull-request.
Thanks.
Interesting. I can confirm that Google interprets that file as allowing the homepage (/) only. You can test this with the Webmaster Tools Robots.txt tester:
I believe this is because Google interprets the more specific allow as taking precedence.
To add, I think my syntax may be a bit iffy. This also works and I think is more standard:
User-Agent: * Allow: /$ Disallow: /
The $ symbol, incidentally is an end-anchor (to denote the homepage). Put another way, to deny the homepage only, you would use:
Disallow: /$
This would allow everything but the homepage, and works currently with your script.
I appreciate your help, incidentally! There doesn't seem to be any good libraries out there that can handle robots exclusion like Google does so I think it is a valuable project.
@SpiderBro, thanks for the comments. I'll try to write some code to fix shortly, but not sure when exactly, unfortunately :(
This is a hard one for sure!
$parser = new RobotsTxtParser('
User-Agent: *
Allow: /$
Disallow: /
');
var_dump($parser->isAllowed("/")); // bool(true)
var_dump($parser->isAllowed("/asd")); // bool(false)
var_dump($parser->isAllowed("/asd/")); // bool(true) [BUG]
$parser = new RobotsTxtParser('
User-Agent: *
Disallow: /$
');
var_dump($parser->isAllowed("/")); // bool(false)
var_dump($parser->isAllowed("/asd")); // bool(true)
var_dump($parser->isAllowed("/asd/")); // bool(false) [BUG]
From the checkRule()
function:
$directives = array(self::DIRECTIVE_DISALLOW, self::DIRECTIVE_ALLOW);
foreach ($directives as $directive) {
if (isset($this->rules[$userAgent][$directive])) {
foreach ($this->rules[$userAgent][$directive] as $robotRule) {
// change @ for \@
$escaped = strtr($robotRule, array("@" => "\@"));
// match result
if (preg_match('@' . $escaped . '@', $value)) {
$result = ($rule === $directive);
}
}
}
}
The checkRule()
function, and regex in general is way out of my comfort zone, but I'll look into it. No guarantees!
EDIT after a lot of failures:
May all of this be as simple as implementing a string length check?
if $
is found, make sure the path is equal to rule length??
The following robots.txt will be interpreted by the class as denying everything:
User-Agent: Disallow: / Allow: /$
However, Google will interpret this (correctly) as allowing the homepage only.
I wonder if it's possible to update the class to correctly interpret these rules?
Thanks!