How to get the content type and prevent crawling for example feeds?

GoogleCodeExporter commented 9 years ago

Hallo Yassir,

I really like your crawler and use him for my project. Now I run into a 
problem, because the crawler didn't stop from crawling xml feeds. Now I am 
looking for an easy way to stop crawling xml feeds. The crawler should only 
crawl html pages.

My last question was been answered really fast some time ago, so I thought I 
ask you again. 

How can I get the mime type for example in the shouldVisit() function or also 
possible in the visit(Page page) function. Can you give any hints how to solve 
my issue?

Best regards 
Frank

Original issue reported on code.google.com by frank.ro...@gmail.com on 5 Mar 2012 at 4:57

GoogleCodeExporter commented 9 years ago

In shouldVisit function, page is still not downloaded. So, the mime type is not 
known. But we can have it in the visit function. But that is also too late, 
because by that time page is already downloaded. Although you can at least not 
process it. I will include this in feature requests to better handle this issue.

-Yasser

Original comment by ganjisaffar@gmail.com on 12 Mar 2012 at 3:14

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Fixed in revision: c874761011d6

Original comment by avrah...@gmail.com on 22 Aug 2014 at 1:16

Changed state: Fixed

mohankreddy / crawler4j

How to get the content type and prevent crawling for example feeds? #133