lidoapps / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Crawler ignores robots meta-tag from the page #59

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Crawl a page with <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> in 
between <HEAD> tags

What is the expected output? What do you see instead?
Expected: Outgoing URLs are not listed if content is set to "NOFOLLOW".
Instead: The crawler does not pay any attention to the meta tag.

What version of the product are you using? On what operating system?
2.6.1

Please provide any additional information below.
It should be possible to stop the crawler from following links on a page or 
indexing the page.
http://www.robotstxt.org/meta.html

Original issue reported on code.google.com by janne.pa...@documill.com on 20 Jul 2011 at 7:49

GoogleCodeExporter commented 9 years ago
Hi can any one please let me know if this issue has been in solved in any of 
the version. I think a good crawler must support this.

Thanks,
Naveen 

Original comment by naveensh...@gmail.com on 5 May 2013 at 3:45

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:08

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:10

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:12