mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

How to get the content type and prevent crawling for example feeds? #133

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hallo Yassir,

I really like your crawler and use him for my project. Now I run into a 
problem, because the crawler didn't stop from crawling xml feeds. Now I am 
looking for an easy way to stop crawling xml feeds. The crawler should only 
crawl html pages.

My last question was been answered really fast some time ago, so I thought I 
ask you again. 

How can I get the mime type for example in the shouldVisit() function or also 
possible in the visit(Page page) function. Can you give any hints how to solve 
my issue?

Best regards 
Frank

Original issue reported on code.google.com by frank.ro...@gmail.com on 5 Mar 2012 at 4:57

GoogleCodeExporter commented 9 years ago
In shouldVisit function, page is still not downloaded. So, the mime type is not 
known. But we can have it in the visit function. But that is also too late, 
because by that time page is already downloaded. Although you can at least not 
process it. I will include this in feature requests to better handle this issue.

-Yasser

Original comment by ganjisaffar@gmail.com on 12 Mar 2012 at 3:14

GoogleCodeExporter commented 9 years ago
Fixed in revision: c874761011d6

Original comment by avrah...@gmail.com on 22 Aug 2014 at 1:16