spatie / robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
https://spatie.be/en/opensource/php
MIT License
219 stars 36 forks source link

Custom UserAgent mismatches due to parseUserAgent() #25

Closed muhci closed 4 years ago

muhci commented 4 years ago

When using a custom user agent: $robots = Robots::create('UserAgent007'); with robots.txt as follows: User-agent: UserAgent007 Disallow: /

These two configurations don't get matched correctly with each other due to parsing of robots.txt user agent line being parsed to lowercase strings. Here's where the lowercase conversion happens before doing an array key check:

protected function parseUserAgent(string $line): string { return trim(str_replace('user-agent', '', strtolower(trim($line))), ': '); }

The line that still assumes defined user agent (UserAgent007) to match with parsed, lowercased user agent (useragent007), causing a mismatch, and ending up ignoring the rules. $disallows = $this->disallowsPerUserAgent[$userAgent] ?? $this->disallowsPerUserAgent['*'] ?? [];

freekmurze commented 4 years ago

Could you PR a fix for this? Make sure to include tests.

BinaryKitten commented 4 years ago

has this been fixed with #26 ?
(came here from "good first issue")

muhci commented 4 years ago

has this been fixed with #26 ? (came here from "good first issue")

yes, it's fixed.

BinaryKitten commented 4 years ago

cool - @freekmurze can we get issue closed? Would love to contribute so was looking for "good first issue" - but seems this has been fixed.