jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Unable to parse the entire structure of a website (hitam.org) using BaseCrawler code #211

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Use the sample BasicCrawler code for 3.5 version of crawler4j
2. Modified the seed to controller.addSeed("http://hitam.org/");
3. Ran the code

What is the expected output? What do you see instead?

I was expecting that the code would crawl through all the pages in the home 
page and then to each page that it identifies in the out going link of home 
page.

In short was expecting call to come to each link to the BasicCrawler.java for 
every link identified on the home page.

What version of the product are you using?
3.5

Please provide any additional information below.
Tried similar thing on site like bestbuy and others, it didn't work
Same works fine for the www.ics.uci.edu site.

Original issue reported on code.google.com by susheel....@gmail.com on 26 Mar 2013 at 8:02

GoogleCodeExporter commented 8 years ago
Hi

It would be really helpful, if you can point me to the right documentation / 
let me know what the problem is. 

We are in the process of finalizing the crawler to be used.

Thanks..Susheel

Original comment by susheel....@corp.247customer.com on 1 Apr 2013 at 7:26

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:36

GoogleCodeExporter commented 8 years ago
Tested Bestbuy and hitam and both work ok.

please try again and find a specific page which doesn't get crawled so we can 
investigate

Original comment by avrah...@gmail.com on 20 Aug 2014 at 1:44

GoogleCodeExporter commented 8 years ago
Works for me

Original comment by avrah...@gmail.com on 23 Sep 2014 at 2:07