jiraken / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

MakeCrawlerJ distributed #61

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?
2.6.1

Please provide any additional information below.
There is support for multithreaded crawling, but no support for multiple 
process to split the crawling load. A distributed crawler will be great feature 
to have.

Original issue reported on code.google.com by nishant....@gmail.com on 26 Jul 2011 at 5:44

GoogleCodeExporter commented 8 years ago
I think it should be enough changing where Crawler4j stores the queue, offering 
an atomic access to the stored queue.
By now it stores the queue to the local disk. If the queue is stored on a 
remote folder, the urls contained in the queue will be partitioned among the 
several Crawler4j instances.

Original comment by giovanni...@gmail.com on 7 Oct 2013 at 4:56

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:08

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:10

GoogleCodeExporter commented 8 years ago
Is anybody interested in pursuing this with me?

Original comment by r.hamn...@gmail.com on 27 Aug 2014 at 10:19

GoogleCodeExporter commented 8 years ago
I am not there yet.

I have a lot on my table yet to make crawler4j more stable before optimizing it.

But please go ahead, I am willing and able to help as much as I can/know.

Please not that I have planned on this:
https://code.google.com/p/crawler4j/issues/detail?id=271

With the original author (Yasser) which will change the architecture quite a 
bit and is supposed to give the crawler a huge boost

Original comment by avrah...@gmail.com on 28 Aug 2014 at 5:13