spatie / robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
https://spatie.be/en/opensource/php
MIT License
219 stars 36 forks source link

False positives on bare domains with no trailing slash #15

Closed mikemike closed 5 years ago

mikemike commented 5 years ago

After a while debugging, I've discovered that providing an URL without a trailing slash (https://example.com and not https://example.com/) fails certain checks, notable the robots.txt mayIndex() check.

This makes sense, because if there's no path returned when the URL is parsed, but if there is a Disallow: blank rule in the robots.txt file (which a lot do have) it will match an empty string with a blank path and mayIndex() will respond false.

3 possible fixes:

  1. Update docs to be more clear. A simple note to say that a trailing slash is required for bare domains.
  2. Add a slash if a bare URL is provided without one.
  3. When looping through the URLs to check against (line 49 in RobotsTxt.php) check if the left side is an empty string and ignore it.
spatie-bot commented 5 years ago

Dear contributor,

because this issue seems to be inactive for quite some time now, I've automatically closed it. If you feel this issue deserves some attention from my human colleagues feel free to reopen it.