t1gor / Robots.txt-Parser-Class

Php class for robots.txt parse
MIT License
83 stars 31 forks source link

Fail for User-agent rule #64

Closed LeMoussel closed 8 years ago

LeMoussel commented 8 years ago

Fail for User-agent rule.

To produce the bug with Googlebot user agent:

        $robotsTxtContent = "
User-agent: Googlebot
Disallow: /deny_googlebot/$";
        $robotsTxtParser  = new RobotsTxtParser($robotsTxtContent);
        var_dump($robotsTxtParser->isAllowed("http://mysite.com/deny_googlebot/", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0") == false);
// bool(false)
// BUG: Should return bool(true)
JanPetterMG commented 8 years ago

Sure about that?

If I'll turn it the other way, removing the == false, and switching to isDisallowed then it does the exact same check. The only difference is the human readability.

var_dump($robotsTxtParser->isDisallowed("http://mysite.com/deny_googlebot/", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0"));
//returns bool(false)

I see no reason for Mozilla Firefox to be denied, when there is no matching user-agent, and no generic rules covering any non-listed user-agent either.

From Google's Robots.txt Specifications:

By default, there are no restrictions for crawling

LeMoussel commented 8 years ago

This is Googlebot user agent. From https://support.google.com/webmasters/answer/1061943

JanPetterMG commented 8 years ago

Googlebot looks like this: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) or even Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) All of them contains the word Googlebot...

Are you trying to catch the requests? eg. blocking them or similar? When Googlebot wants to access your site, it caches the robots.txt (for up to 24h) on google servers, and checks for user-agents listed with googlebot or maybe googlebot-news. Those are the selected rules.

I have never heard of any bot / crawler that is exclusively checking for the full user-agent, and not the bot's name... Remember that the parsing process is always done remote, by the crawler, while all you do is set the rules. If the rules aren't followed, I would generally recommend blocking that user-agent, prevent it from accessing your site at all...

For example: SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) always matches one of these when parsing:

Maybe you are thinking of anything I didn't cover in this post?

JanPetterMG commented 8 years ago

At client side (the crawler), there is no problem adding an User-agent parser, checking either it's own full user-agent string, or one or more parts, but that doesn't really make any difference, since no one (at server side) are listing any rules that eventually would match, in their robots.txt files.

That means that you can deny any crawler identifying it self as like Gecko, or even AppleWebKit, but I doubt most crawlers would honor these rules at all. Anyway, it's an improvement.

On the other side, this is still an improvement worth adding, if anyone are checking the rules against the full user-agent string, they would never find any matching rule, witch means crawling may be allowed (witch it really isn't).

JanPetterMG commented 8 years ago

Actually, I this link clarifies the issue: https://support.google.com/webmasters/answer/1061943

The full User-Agent string witch are used when crawling, is not the same as used in robots.txt, X-Robots-tag and Robots-meta-tag.

After some research about User-agents in general, the full user-agent string from bots may often vary to force the webpage to serve different layouts and pages (eg. mobile or desktop). This should under no circumstances affect the rules, as the rules should be applied to the content only, and not what device the content is optimized for.

Also, parsing the full string, may be an incredibly painful process.

If anyone needs to deny a specific device (or eg. browser technology) from accessing their site, they need to parse the user-agent string (eg. by searching for a specific word), and either block, or redirect the request to a different page. The robots.txt directives is both made, and intended for robots only, not device restrictions! In addition to that, robots.txt rules does only apply when the request is not on behalf of a user...

All bots should have a name, identifiable from the User-Agent string, witch are used to determine what rules to apply. If you'r bot have multiple names (witch it shouldn't), you should check each of them. By multiple, I mean completely different names. Not like eg. googlebot-news and googlebot-images, (witch both are matched as googlebot if no other more specific user-agent group is defined, this is an supported feature in this parser library).

The Enchantment label: Some sort of warning, or even a error message should be generated if the user tries to check for rules using the full user-agent string! (@todo)

Wrong usage may lead to checking against the wrong rules, witch in the long run, may lead to blocking of the IP addresses used. Worst case scenario would be an listing on public blacklists on the internet, preventing you from accessing any site at all...

SpiderBro commented 8 years ago

User-agent matching hasn't really changed since the original spec:

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended. http://www.robotstxt.org/orig.html

JanPetterMG commented 8 years ago

Just want to add a few things: Providing the version, along with the user-agent name is supported by this library, eg. googlebot/2.1. Same goes for tags, witch also are supported, eg. googlebot-news.

  1. Check for exact match: googlebot-images/1.0
  2. Strip version number: googlebot-images
  3. Remove tags (if multiple, remove one after one): googlebot
  4. Fallback to default: *

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt#order-of-precedence-for-user-agents

JanPetterMG commented 8 years ago

Version 0.2.2 now generates an warning if the user-agent format is unsupported. It also tells what format to use.