sjdirect / nrobots

The Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. This project provides an easy-to-use class, implemented in C#, to work with robots.txt files.
Microsoft Public License
15 stars 9 forks source link

nrobots doesn't handle entries like /?/ properly #3

Open sirjimjones opened 9 years ago

sirjimjones commented 9 years ago

nrobots doesn't handle entries like "Disallow: /?/" properly.

sjdirect commented 9 years ago

Looks like NRobots is converting the "Disallow: /?/" into "Disallow: /". The last checkin to that lib fixed a similar issue but obviously this presents other issues.

sjdirect commented 9 years ago

Even though this bug still exists in the nrobots lib, abot gives a workaround to this and similar issues with robots.txt preventing the crawl. See https://github.com/sjdirect/abot/commit/9bd3d7d91ebefb6e03ee2c2a1b5140cc4020073c for details. There is now an isIgnoreRobotsDotTextIfRootDisallowedEnabled config value that if set to true will ignore the robots.txt file for when the root uri of the crawl is disallowed.