t1gor / Robots.txt-Parser-Class

Php class for robots.txt parse
MIT License
83 stars 31 forks source link

Allow/Disallow rules not handled correctly #76

Open ogolovanov opened 8 years ago

ogolovanov commented 8 years ago

From https://yandex.com/support/webmaster/controlling-robot/robots-txt.xml?lang=ru#simultaneous

The Allow and Disallow directives from the corresponding User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect how they are used by the robot.

Source robots.txt:

User-agent: Yandex Allow: / Allow: /catalog/auto Disallow: /catalog

Sorted robots.txt:

User-agent: Yandex Allow: / Disallow: /catalog Allow: /catalog/auto

$c = <<<ROBOTS User-agent: * Allow: / Allow: /catalog/auto Disallow: /catalog ROBOTS;

$r = new RobotsTxtParser($c); $url = 'http://test.ru/catalog/'; var_dump($r->isDisallowed($url));

Result: false Expected result: true

LeMoussel commented 7 years ago

For Google this is different :

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#google-supported-non-group-member-records