Closed GoogleCodeExporter closed 9 years ago
As I had mentioned in http://code.google.com/p/crawler4j/ there has been a bug
in 3.0 and this version should not be used. I have provided the fix in 3.1
Thanks,
Yasser
Original comment by ganjisaffar@gmail.com
on 18 Jan 2012 at 6:26
Thanks a lot for information!
Original comment by mansur.u...@gmail.com
on 18 Jan 2012 at 7:36
Hi Yasser,
As I understood, you fixed concurrency issues (for multiple crawlers) in
PageFetcher.
However, my problem was different and I was using only one crawler. As I
mentioned above, at the beginning it fetches some links and then starts giving
error message
"ERROR [Crawler 6] Fatal transport error: Read timed out while fetching
http://..." for almost all links.
If you test the "http://www.ics.uci.edu/" with 2.6 and 3.1 version, you will
see the significant difference.
That means the issue still exists in 3.1 too.
Thanks & Regards,
Mansur
Original comment by mansur.u...@gmail.com
on 18 Jan 2012 at 9:11
This doesn't repro for me. Can you share you controller code. Notice that when
you get time outs after crawling for a while it might be because the target
server is blocking your ip because of flooding requests.
-Yasser
Original comment by ganjisaffar@gmail.com
on 18 Jan 2012 at 10:53
[deleted comment]
Hi Yasser,
Here are my controller and crawler classes:
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "/data/crawl/root";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setProxyHost("proxy");
config.setProxyPort(8080);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller;
try {
controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed("http://www.ics.uci.edu/");
controller.start(MyCrawler.class, numberOfCrawlers);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
and my crawler class:
public class MyCrawler extends WebCrawler {
private static final Logger logger = Logger.getLogger(MyCrawler.class.getName());
private final static Pattern FILTERS = Pattern
.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/**
* * You should implement this function to specify whether * the given url
* should be crawled or not (based on your * crawling logic).
*/
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
/*return !FILTERS.matcher(href).matches()
&& href.startsWith("http://www.ics.uci.edu/");*/
//If the following 2 lines are deleted, the issue does not appear
PageFetchResult fetchResult = getMyController().getPageFetcher().fetchHeader(url);
int statusCode = fetchResult.getStatusCode();
if(statusCode != PageFetchStatus.OK && statusCode != PageFetchStatus.Moved
&& statusCode != PageFetchStatus.PageTooBig) {//&& statusCode != PageFetchStatus.PageIsBinary) { //if not fetched nor moved
logger.info("Broken URL: " + url.getURL());
return false;
}
//logger.info("Fetched URL: " + url.getURL());
return true;
}
/**
* * This function is called when a page is fetched and ready * to be
* processed by your program.
*/
@Override
public void visit(Page page) {
logger.info("Fetched URL: " + page.getWebURL().getURL());
}
}
If I don't call page fetcher in shouldVisit() everything works fine, otherwise
the previous issue appears "fatal error ...".
My assumption is that there is something wrong with "mutex" of PageFetcher and
the next request to the target server is carried out before timeout finishes.
If you use old PageFetcher with static methods and fields, everything works
fine.
Thanks & Regards,
Mansur
Original comment by mansur.u...@gmail.com
on 19 Jan 2012 at 2:17
Hi Mansur
Even i am looking to get the info about the 404s/301/302 errors.. But based on
your code, I understood that you are re-fetching the headers again to check for
broken links.. Which means 1 fetch by crawler4j and another by your code... I
feel this is an expensive operation.
I have raised a suggestion to yasser, to somehow provide a method in WebURL to
get the statuscode. That makes more sense.. What do you say?
Regs
Original comment by w3engine...@gmail.com
on 19 Jan 2012 at 3:10
Hi,
that would be really good idea. It will probably solve my problem.
Tkx & Regs
Original comment by mansur.u...@gmail.com
on 20 Jan 2012 at 8:23
The problem happens when you continuously call getPageFetcher() and do not do a
subsequent fetchResult.discardContentIfNotConsumed() when finished with the
result. The connections will keep on building until there are no more capacity
for HTTP Client to create new connections. This causes the time out.
Original comment by jtgeo...@gmail.com
on 16 Jan 2013 at 1:44
Hi,
I am using crawler4j version 3.1 in example basic. I check link following code
controller.addSeed("http://muaban.net/ho-chi-minh/raovat/43/ban-nha/index.html")
;
The result always :
INFO [Crawler 1] Exception while fetching content for:
http://muaban.net/ho-chi-minh/raovat/43/ban-nha/index.html [Unexpected end of
ZLIB input stream]
How can I fix this problem. Please help me in this situation
Original comment by lethanht...@gmail.com
on 18 Feb 2013 at 4:29
I have broblem the same you. Please help me! Thank a lot.
http://www.biframework.com.vn/page/products/id/9/phan-mem-quan-ly-doanh-nghiep.h
tml
Original comment by phamda...@gmail.com
on 13 Sep 2013 at 11:25
I got it, thanks for your help.
Br,
Original comment by tt44...@gmail.com
on 13 Sep 2013 at 2:58
Help me !!!
Dear brother / sister: do you have the Import file list on the software vendor
Biframework but when I import into the tax code column is blank Excel file
outside though they have full filled MST. He / She helped me with.
http://biframeworks.com/phan-mem/phan-mem-quan-ly-doanh-nghiep-erp/4
Original comment by nhanhpt...@gmail.com
on 4 Mar 2014 at 6:25
Original issue reported on code.google.com by
mansur.u...@gmail.com
on 18 Jan 2012 at 2:24