Crawler ignores robots meta-tag from the page

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Crawl a page with <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> in 
between <HEAD> tags

What is the expected output? What do you see instead?
Expected: Outgoing URLs are not listed if content is set to "NOFOLLOW".
Instead: The crawler does not pay any attention to the meta tag.

What version of the product are you using? On what operating system?
2.6.1

Please provide any additional information below.
It should be possible to stop the crawler from following links on a page or 
indexing the page.
http://www.robotstxt.org/meta.html

Original issue reported on code.google.com by janne.pa...@documill.com on 20 Jul 2011 at 7:49

GoogleCodeExporter commented 9 years ago

Hi can any one please let me know if this issue has been in solved in any of 
the version. I think a good crawler must support this.

Thanks,
Naveen

Original comment by naveensh...@gmail.com on 5 May 2013 at 3:45

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:08

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:10

Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:12

Changed state: Accepted

lidoapps / crawler4j

Crawler ignores robots meta-tag from the page #59