pgiri / dispy

Distributed and Parallel Computing Framework with / for Python
https://dispy.org
Other
266 stars 55 forks source link

JobCluster and/or SharedJobCluster #112

Open alfoa opened 6 years ago

alfoa commented 6 years ago

I was wondering if there is a way to change the "computation" function each time I invoke a submit. In my project (https://github.com/idaholab/raven) we heavily use parallel computation for several tasks during the same analysis. For doing so, we need to switch from a "computation" to the other relatively often. I would like to avoid to need to "reinitialize" a SharedJobCluster/JobCluster every time I need to evaluate a new function (since I would like to avoid the overhead of the nodes' discovery, etc.).

Do you have any suggestion?

pgiri commented 6 years ago

See attached program, modified from 'sample.py' example. This submits each job with a new function. Note that dependencies sent with 'dispy_job_depends' are transient - if another job needs a function already sent over, it won't be available, so function (in this case) needs to be sent again. If you have bunch of functions that are needed again, then those must be sent in 'depends' of Compute - those will be available for the duration of cluster's life.

compute_py.txt

Alternately, you can use pycos which allows you run computations with different functions; see 'dispycos_client8.py' for an example.

alfoa commented 6 years ago

Hi. Thank you very much. Yes, I was actually implementing something similar. I tried to run your modified script but dispy just hangs without executing the jobs ( I think it hangs in waiting for the jobs to return:

host, n = job() # waits for job to finish and returns results

) Probably something went wrong with the setup. I installed dispy and pycos from source (cloning from github). and using the setup.py script. My machine is a MAC, OS Sierra (10.12.6 (16G1036))

Did you experience anything like this?

Tried to re-install but I experience the same problem

alfoa commented 6 years ago

It hangs with this message:

2018-03-19 17:14:31 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-19 17:14:31 dispy - dispy client version: 4.8.5
2018-03-19 17:14:31 dispy - Storing fault recovery information in "_dispy_20180319171431"
2018-03-19 17:14:31 dispy - dispy client at 134.20.218.80:51347

If before trying to retrieve the jobs I ask the cluster status, I get:

                           Node |  CPUs |    Jobs |    Sec/Job | Node Time Sec
------------------------------------------------------------------------------

Jobs pending: 4
Total job time: 0.000 sec, wall time: 58.311 sec, speedup: 0.000

It looks like that it does not identify the resources?

pgiri commented 6 years ago

I tested the above program with both Linux and OS X.

It seems the client is not detecting node(s). You may want to try adding node's address or name in 'nodes', e.g., with

cluster = dispy.JobCluster(compute, nodes=['134.20.218.x'], loglevel=dispy.logger.DEBUG)

(Replace 'x' with appropriate octect for your node.) If IP address / name is not given, dispy uses UDP to discover nodes and due to firewalls and / or loss of UDP data, it may fail to find nodes.

pgiri commented 6 years ago

If you tried above with github clone, then please try with latest release 4.8.4 (from 'pip' or sourceforge). Apparently github is now not working for UDP broadcast which probably is why the client is not detecting nodes.

alfoa commented 6 years ago

Thanks. I will try to use the sourceforge one. And let you know.

Thanks again

alfoa commented 6 years ago

Unfortunately it looks like the behavior does not change. (pycos-4.6.5 and dispy-4.8.4, both from sourceforge). I explicitly provided the localhost hostname It looks like that the socket does not detect the processors/nodes in the local machine.

018-03-20 11:18:21 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-20 11:18:21 dispy - dispy client version: 4.8.4
2018-03-20 11:18:21 dispy - Storing fault recovery information in "_dispy_20180320111821"
2018-03-20 11:18:21 dispy - dispy client at 134.20.218.80:51347

If I try with a SharedJobCluster I get a connection refused socket error message.

I tried with a simple socket client/server program and it works fine in the localhost and same port (attached files).

server_ok.py.txt client_ok.py.txt

I also tried to completely disable the firewall. No success.

Thank you very much for your help.

Andrea

pgiri commented 6 years ago

Is it possible another dispy client is running? Check processes and kill any left behind and try again? You can also check if any other processes are taking up the port with 'netstat'.

pgiri commented 6 years ago

You can also try giving different port with

cluster = dispy.JobCluster(compute, port=4567)
alfoa commented 6 years ago

I checked. No other dispy instances are running. If I check with netstat I see that the port 51347 is taken by no other process:

Active Internet connections
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)   
udp4 0           0    *.51347                     *.*    

Using another port (e.g. 4567), does not solve the problem.

pgiri commented 6 years ago

Hmm, I think I am a bit confused. Is the error

2018-03-20 10:58:48 dispy - Could not bind TCP server to 134.20.218.80:51347

in the email I received? It seems the messages above indicate problem is client is not detecting nodes instead. If client is not detecting nodes, make sure that 'dispynode.py' program is running on a node (e.g., on same computer running dispy client). It should report number of CPUs available for dispy. Then start the client. If it still not detecting the node, then try giving '127.0.0.1' in the 'nodes' as mentioned before.

Below is the output of running example attached above (for reference):

dispynode.py output:

2018-03-20 14:11:21 dispynode - dispynode version: 4.8.5, PID: 40904
2018-03-20 14:11:21 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-20 14:11:21 dispynode - "Giri-Mac.lan" serving 2 cpus
2018-03-20 14:11:21 dispynode - TCP server at 192.168.1.102:51348

Enter "quit" or "exit" to terminate dispynode,
  "stop" to stop service, "start" to restart service,
  "cpus" to change CPUs used, anything else to get status: 2018-03-20 14:11:33 dispynode - New computation "06d4e0dd69cc3d0fcac9da9b3b6f32330fe4340e" from 192.168.1.102
2018-03-20 14:11:33 dispynode - New job id 4322798416 from 192.168.1.102/192.168.1.102
2018-03-20 14:11:33 dispynode - New job id 4323515088 from 192.168.1.102/192.168.1.102
2018-03-20 14:11:42 dispynode - Sending result for job 4323515088 (11) to ('192.168.1.102', 51347)
2018-03-20 14:11:42 dispynode - New job id 4323514960 from 192.168.1.102/192.168.1.102
2018-03-20 14:11:43 dispynode - Sending result for job 4322798416 (11) to ('192.168.1.102', 51347)
2018-03-20 14:11:43 dispynode - New job id 4323515600 from 192.168.1.102/192.168.1.102
2018-03-20 14:11:47 dispynode - Sending result for job 4323514960 (11) to ('192.168.1.102', 51347)
2018-03-20 14:11:52 dispynode - Sending result for job 4323515600 (11) to ('192.168.1.102', 51347)
2018-03-20 14:11:52 dispynode - Computation "06d4e0dd69cc3d0fcac9da9b3b6f32330fe4340e" from 192.168.1.102 done

and the output from client:

2018-03-20 14:11:31 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-20 14:11:31 dispy - dispy client version: 4.8.5
2018-03-20 14:11:32 dispy - Storing fault recovery information in "_dispy_20180320141131"
2018-03-20 14:11:32 dispy - dispy client at 192.168.1.102:51347
2018-03-20 14:11:32 dispy - Discovered 192.168.1.102:51348 (Giri-Mac.lan) with 2 cpus
2018-03-20 14:11:33 dispy - Running job 4322798416 on 192.168.1.102
2018-03-20 14:11:33 dispy - Running job 4323515088 on 192.168.1.102
2018-03-20 14:11:33 dispy - Running job 0 / 4322798416 on 192.168.1.102 (busy: 2 / 2)
2018-03-20 14:11:33 dispy - Running job 1 / 4323515088 on 192.168.1.102 (busy: 2 / 2)
2018-03-20 14:11:42 dispy - Received reply for job 1 / 4323515088 from 192.168.1.102
2018-03-20 14:11:42 dispy - Running job 4323514960 on 192.168.1.102
2018-03-20 14:11:42 dispy - Running job 2 / 4323514960 on 192.168.1.102 (busy: 2 / 2)
2018-03-20 14:11:43 dispy - Received reply for job 0 / 4322798416 from 192.168.1.102
Giri-Mac.lan: func1 (192.168.1.102) executed job 0 at 1521569493.09 with 10
Giri-Mac.lan: func2 (192.168.1.102) executed job 1 at 1521569493.09 with 9
2018-03-20 14:11:43 dispy - Running job 4323515600 on 192.168.1.102
2018-03-20 14:11:43 dispy - Running job 3 / 4323515600 on 192.168.1.102 (busy: 2 / 2)
2018-03-20 14:11:47 dispy - Received reply for job 2 / 4323514960 from 192.168.1.102
Giri-Mac.lan: func1 (192.168.1.102) executed job 2 at 1521569502.11 with 5
2018-03-20 14:11:52 dispy - Received reply for job 3 / 4323515600 from 192.168.1.102
Giri-Mac.lan: func2 (192.168.1.102) executed job 3 at 1521569503.11 with 9

                           Node |  CPUs |    Jobs |    Sec/Job | Node Time Sec
------------------------------------------------------------------------------
 192.168.1.102 (Giri-Mac.lan)   |     2 |       4 |      8.260 |        33.040

Total job time: 33.040 sec, wall time: 19.605 sec, speedup: 1.685

2018-03-20 14:11:52 dispy - jobs run: 4
2018-03-20 14:11:52 dispy - Closing node 192.168.1.102 for compute / 1521569492515
2018-03-20 14:11:52 dispy - Shutting down scheduler ...
2018-03-20 14:11:52 dispy - Scheduler quitting: 0
2018-03-20 14:11:52 dispy - Scheduler quit
2018-03-20 14:11:52 pycos - terminating task !timer_proc/4321344864 (daemon)
2018-03-20 14:11:52 pycos - terminating task !udp_server/4325859408 (daemon)
2018-03-20 14:11:52 pycos - terminating task !tcp_server/4322824048 (daemon)
2018-03-20 14:11:52 pycos - pycos terminated
alfoa commented 6 years ago

I really want to thank you for all this effort.

The problem was not the client but the server. The dispynode.py was blocked and killed by my company firewall. I needed to add an exception for it and now it works just fine.

Thank you very much.

If you are interested, I will inform you about our progresses in moving from ParallelPython to dispy/pycos.

Thanks again.

pgiri commented 6 years ago

Great!

Yes, your progress / comments will be appreciated / helpful.

alfoa commented 6 years ago

Sorry to bother again.

With the JobCluster and dispynode (different type of combination) everything looks good.

I was now trying to test SharedJobCluster and dispyscheduler.py option. Unfortunately I get the following error when I try to run dispyscheduler.py

Traceback (most recent call last):
  File "/Users/alfoa/miniconda2/envs/raven_libraries/bin/dispyscheduler.py", line 4, in <module>
    __import__('pkg_resources').run_script('dispy==4.8.4', 'dispyscheduler.py')
  File "/Users/alfoa/miniconda2/envs/raven_libraries/lib/python2.7/site-packages/pkg_resources/__init__.py", line 750, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/alfoa/miniconda2/envs/raven_libraries/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1527, in run_script
    exec(code, namespace, namespace)
  File "/Users/alfoa/miniconda2/envs/raven_libraries/lib/python2.7/site-packages/dispy-4.8.4-py2.7.egg/EGG-INFO/scripts/dispyscheduler.py", line 2172, in <module>
    scheduler.scheduler_task.value()
AttributeError: '_Scheduler' object has no attribute 'scheduler_task'

Thanks again

pgiri commented 6 years ago

I guess you are using it as daemon. Can you replace

scheduler.scheduler_task.value()

with

scheduler.job_scheduler_task.value()

on line 2170 in dispyscheduler.py? I will commit fix later, but github is not stable yet.

alfoa commented 6 years ago

Yep. Thanks. I did it. No crash anymore but now problem with binding:

2018-03-20 15:40:47 dispyscheduler - dispyscheduler version 4.8.4
2018-03-20 15:40:47 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-20 15:40:47 dispyscheduler - Could not bind scheduler server to 134.20.218.80:51349

I will work dispynode.py and JobCluster for now to create the infrastructure and in case you find a fix I will grep it later.

Thanks in advance.

Edit: PS. I checked that no other proccess is using that port.

pgiri commented 6 years ago

Have you managed to get it working? If this issue is resolved, could you close it?

alfoa commented 6 years ago

@pgiri Hi, unfortunately I had to stop for a while. I will get back to this asap. If you want you can close the issue and I will reopen (or email you directly (if you want)) in case I experience the same problems.

Thanks.

pgiri commented 6 years ago

You can email me directly, of course. I was wondering if you still have issue(s) with the above.

BTW, current release has 'delegate.py' example to compute different computations in each job, so I will close this issue. You can open a new one in case you have other issues.