Issue with robots.txt wildcards/end anchors

SpiderBro commented 10 years ago

The following robots.txt will be interpreted by the class as denying everything:

User-Agent: Disallow: / Allow: /$

However, Google will interpret this (correctly) as allowing the homepage only.

I wonder if it's possible to update the class to correctly interpret these rules?

Thanks!

t1gor commented 10 years ago

That's a good one. Let me see what we can do ...

t1gor commented 10 years ago

I have reviewed the example you provided and I am now confused :)

According to Google, the tailing wildcard after "/" will be ignored. robots txt specifications

"/*" => "/" == "/$" (?)

Which means that you are trying to forbid the home page and allow it at the same time. If I get it right, you are trying to disallow all pages except for the home page, right?

I will do some coding to prevent this, but probably later. If you have a working code snippet, please consider creating a pull-request.

Thanks.

SpiderBro commented 10 years ago

Interesting. I can confirm that Google interprets that file as allowing the homepage (/) only. You can test this with the Webmaster Tools Robots.txt tester:

allowed

disallowed

I believe this is because Google interprets the more specific allow as taking precedence.

SpiderBro commented 10 years ago

To add, I think my syntax may be a bit iffy. This also works and I think is more standard:

User-Agent: * Allow: /$ Disallow: /

The $ symbol, incidentally is an end-anchor (to denote the homepage). Put another way, to deny the homepage only, you would use:

Disallow: /$

This would allow everything but the homepage, and works currently with your script.

I appreciate your help, incidentally! There doesn't seem to be any good libraries out there that can handle robots exclusion like Google does so I think it is a valuable project.

t1gor commented 10 years ago

@SpiderBro, thanks for the comments. I'll try to write some code to fix shortly, but not sure when exactly, unfortunately :(

JanPetterMG commented 8 years ago

This is a hard one for sure!

$parser = new RobotsTxtParser('
User-Agent: *
Allow: /$
Disallow: /
');
var_dump($parser->isAllowed("/")); // bool(true)
var_dump($parser->isAllowed("/asd")); // bool(false)
var_dump($parser->isAllowed("/asd/")); // bool(true) [BUG]

$parser = new RobotsTxtParser('
User-Agent: *
Disallow: /$
');
var_dump($parser->isAllowed("/")); // bool(false)
var_dump($parser->isAllowed("/asd")); // bool(true)
var_dump($parser->isAllowed("/asd/")); // bool(false) [BUG]

From the checkRule() function:

        $directives = array(self::DIRECTIVE_DISALLOW, self::DIRECTIVE_ALLOW);
        foreach ($directives as $directive) {
            if (isset($this->rules[$userAgent][$directive])) {
                foreach ($this->rules[$userAgent][$directive] as $robotRule) {
                    // change @ for \@
                    $escaped = strtr($robotRule, array("@" => "\@"));

                    // match result
                    if (preg_match('@' . $escaped . '@', $value)) {
                        $result = ($rule === $directive);
                    }
                }
            }
        }

The checkRule() function, and regex in general is way out of my comfort zone, but I'll look into it. No guarantees!

EDIT after a lot of failures: May all of this be as simple as implementing a string length check? if $ is found, make sure the path is equal to rule length??

t1gor / Robots.txt-Parser-Class

Issue with robots.txt wildcards/end anchors #10