Scale crawler to a client/server design aiming for full distributed system support

spritt82 / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler

0 stars 0 forks source link

Scale crawler to a client/server design aiming for full distributed system support #18

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago

Current behaviour:

The crawler now runs only on single-system configurations.

Desired behaviour:

The crawler must be able to run on multiple machines in parallel in a
transparent way for the user (enable/disabled/set in xml config file). One
possible solution is to use libraries like Pyro:
http://pyro.sourceforge.net/. Usage of Python is desirable for consistency.

Original issue reported on code.google.com by andrei.p...@gmail.com on 17 Jul 2008 at 9:33

GoogleCodeExporter commented 8 years ago

I am changing the title to "Scale crawler to a client/server design aiming for 
full
distributed system support". The client/server split is more important than any 
full
fledged distributed design using Pyro, since it allows to scale the crawler to 2
machines.

I will be working on the client/server design and splitting the application 
classes
to client/server code soon.

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:21

Changed title: Scale crawler to a client/server design aiming for full distributed system support

GoogleCodeExporter commented 8 years ago

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:21

Added labels: Type-Task
Removed labels: Type-Defect

GoogleCodeExporter commented 8 years ago

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:21

Added labels: Priority-Critical
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

How would this look?
Server parses the config.xml, splits the crawling job into subpages and 
distributes
the set of subpages to its slave computers?
or?

Original comment by szybal...@gmail.com on 12 Oct 2008 at 5:25

GoogleCodeExporter commented 8 years ago

why not using python's 2.6 default library:
http://docs.python.org/library/multiprocessing.html#module-multiprocessing

I quote:
"multiprocessing is a package that supports spawning processes using an API 
similar
to the threading module. The multiprocessing package offers both local and 
remote
concurrency, effectively side-stepping the Global Interpreter Lock by using
subprocesses instead of threads. Due to this, the multiprocessing module allows 
the
programmer to fully leverage multiple processors on a given machine. It runs on 
both
Unix and Windows."

Original comment by andrei.p...@gmail.com on 12 Oct 2008 at 4:04

GoogleCodeExporter commented 8 years ago

This is an interesting angle. I never thought of using something directly in 
standard
Python so far.

Btw, this is only from Python 2.6+, so this feature won't work with 2.4 <= 
Python <
2.6. Still I think it is a great suggestion. I will read the docs and update.

Original comment by abpil...@gmail.com on 13 Oct 2008 at 5:23

GoogleCodeExporter commented 8 years ago

The library is available for older versions also. In Python 2.6 it was renamed 
and
had some bugs fixed:

http://pyinsci.blogspot.com/2008/09/python-processing.html

http://pypi.python.org/pypi/processing

Original comment by andrei.p...@gmail.com on 13 Oct 2008 at 7:43