Open alfoa opened 6 years ago
See attached program, modified from 'sample.py' example. This submits each job with a new function. Note that dependencies sent with 'dispy_job_depends' are transient - if another job needs a function already sent over, it won't be available, so function (in this case) needs to be sent again. If you have bunch of functions that are needed again, then those must be sent in 'depends' of Compute - those will be available for the duration of cluster's life.
Alternately, you can use pycos which allows you run computations with different functions; see 'dispycos_client8.py' for an example.
Hi. Thank you very much. Yes, I was actually implementing something similar. I tried to run your modified script but dispy just hangs without executing the jobs ( I think it hangs in waiting for the jobs to return:
host, n = job() # waits for job to finish and returns results
)
Probably something went wrong with the setup.
I installed dispy
and pycos
from source (cloning from github).
and using the setup.py
script.
My machine is a MAC, OS Sierra (10.12.6 (16G1036))
Did you experience anything like this?
Tried to re-install but I experience the same problem
It hangs with this message:
2018-03-19 17:14:31 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-19 17:14:31 dispy - dispy client version: 4.8.5
2018-03-19 17:14:31 dispy - Storing fault recovery information in "_dispy_20180319171431"
2018-03-19 17:14:31 dispy - dispy client at 134.20.218.80:51347
If before trying to retrieve the jobs I ask the cluster status, I get:
Node | CPUs | Jobs | Sec/Job | Node Time Sec
------------------------------------------------------------------------------
Jobs pending: 4
Total job time: 0.000 sec, wall time: 58.311 sec, speedup: 0.000
It looks like that it does not identify the resources?
I tested the above program with both Linux and OS X.
It seems the client is not detecting node(s). You may want to try adding node's address or name in 'nodes', e.g., with
cluster = dispy.JobCluster(compute, nodes=['134.20.218.x'], loglevel=dispy.logger.DEBUG)
(Replace 'x' with appropriate octect for your node.) If IP address / name is not given, dispy uses UDP to discover nodes and due to firewalls and / or loss of UDP data, it may fail to find nodes.
If you tried above with github clone, then please try with latest release 4.8.4 (from 'pip' or sourceforge). Apparently github is now not working for UDP broadcast which probably is why the client is not detecting nodes.
Thanks. I will try to use the sourceforge one. And let you know.
Thanks again
Unfortunately it looks like the behavior does not change. (pycos-4.6.5 and dispy-4.8.4, both from sourceforge). I explicitly provided the localhost hostname It looks like that the socket does not detect the processors/nodes in the local machine.
018-03-20 11:18:21 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-20 11:18:21 dispy - dispy client version: 4.8.4
2018-03-20 11:18:21 dispy - Storing fault recovery information in "_dispy_20180320111821"
2018-03-20 11:18:21 dispy - dispy client at 134.20.218.80:51347
If I try with a SharedJobCluster I get a connection refused socket error message.
I tried with a simple socket client/server program and it works fine in the localhost and same port (attached files).
server_ok.py.txt client_ok.py.txt
I also tried to completely disable the firewall. No success.
Thank you very much for your help.
Andrea
Is it possible another dispy client is running? Check processes and kill any left behind and try again? You can also check if any other processes are taking up the port with 'netstat'.
You can also try giving different port with
cluster = dispy.JobCluster(compute, port=4567)
I checked. No other dispy instances are running.
If I check with netstat I see that the port 51347
is taken by no other process:
Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address (state)
udp4 0 0 *.51347 *.*
Using another port (e.g. 4567
), does not solve the problem.
Hmm, I think I am a bit confused. Is the error
2018-03-20 10:58:48 dispy - Could not bind TCP server to 134.20.218.80:51347
in the email I received? It seems the messages above indicate problem is client is not detecting nodes instead. If client is not detecting nodes, make sure that 'dispynode.py' program is running on a node (e.g., on same computer running dispy client). It should report number of CPUs available for dispy. Then start the client. If it still not detecting the node, then try giving '127.0.0.1' in the 'nodes' as mentioned before.
Below is the output of running example attached above (for reference):
dispynode.py output:
2018-03-20 14:11:21 dispynode - dispynode version: 4.8.5, PID: 40904
2018-03-20 14:11:21 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-20 14:11:21 dispynode - "Giri-Mac.lan" serving 2 cpus
2018-03-20 14:11:21 dispynode - TCP server at 192.168.1.102:51348
Enter "quit" or "exit" to terminate dispynode,
"stop" to stop service, "start" to restart service,
"cpus" to change CPUs used, anything else to get status: 2018-03-20 14:11:33 dispynode - New computation "06d4e0dd69cc3d0fcac9da9b3b6f32330fe4340e" from 192.168.1.102
2018-03-20 14:11:33 dispynode - New job id 4322798416 from 192.168.1.102/192.168.1.102
2018-03-20 14:11:33 dispynode - New job id 4323515088 from 192.168.1.102/192.168.1.102
2018-03-20 14:11:42 dispynode - Sending result for job 4323515088 (11) to ('192.168.1.102', 51347)
2018-03-20 14:11:42 dispynode - New job id 4323514960 from 192.168.1.102/192.168.1.102
2018-03-20 14:11:43 dispynode - Sending result for job 4322798416 (11) to ('192.168.1.102', 51347)
2018-03-20 14:11:43 dispynode - New job id 4323515600 from 192.168.1.102/192.168.1.102
2018-03-20 14:11:47 dispynode - Sending result for job 4323514960 (11) to ('192.168.1.102', 51347)
2018-03-20 14:11:52 dispynode - Sending result for job 4323515600 (11) to ('192.168.1.102', 51347)
2018-03-20 14:11:52 dispynode - Computation "06d4e0dd69cc3d0fcac9da9b3b6f32330fe4340e" from 192.168.1.102 done
and the output from client:
2018-03-20 14:11:31 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-20 14:11:31 dispy - dispy client version: 4.8.5
2018-03-20 14:11:32 dispy - Storing fault recovery information in "_dispy_20180320141131"
2018-03-20 14:11:32 dispy - dispy client at 192.168.1.102:51347
2018-03-20 14:11:32 dispy - Discovered 192.168.1.102:51348 (Giri-Mac.lan) with 2 cpus
2018-03-20 14:11:33 dispy - Running job 4322798416 on 192.168.1.102
2018-03-20 14:11:33 dispy - Running job 4323515088 on 192.168.1.102
2018-03-20 14:11:33 dispy - Running job 0 / 4322798416 on 192.168.1.102 (busy: 2 / 2)
2018-03-20 14:11:33 dispy - Running job 1 / 4323515088 on 192.168.1.102 (busy: 2 / 2)
2018-03-20 14:11:42 dispy - Received reply for job 1 / 4323515088 from 192.168.1.102
2018-03-20 14:11:42 dispy - Running job 4323514960 on 192.168.1.102
2018-03-20 14:11:42 dispy - Running job 2 / 4323514960 on 192.168.1.102 (busy: 2 / 2)
2018-03-20 14:11:43 dispy - Received reply for job 0 / 4322798416 from 192.168.1.102
Giri-Mac.lan: func1 (192.168.1.102) executed job 0 at 1521569493.09 with 10
Giri-Mac.lan: func2 (192.168.1.102) executed job 1 at 1521569493.09 with 9
2018-03-20 14:11:43 dispy - Running job 4323515600 on 192.168.1.102
2018-03-20 14:11:43 dispy - Running job 3 / 4323515600 on 192.168.1.102 (busy: 2 / 2)
2018-03-20 14:11:47 dispy - Received reply for job 2 / 4323514960 from 192.168.1.102
Giri-Mac.lan: func1 (192.168.1.102) executed job 2 at 1521569502.11 with 5
2018-03-20 14:11:52 dispy - Received reply for job 3 / 4323515600 from 192.168.1.102
Giri-Mac.lan: func2 (192.168.1.102) executed job 3 at 1521569503.11 with 9
Node | CPUs | Jobs | Sec/Job | Node Time Sec
------------------------------------------------------------------------------
192.168.1.102 (Giri-Mac.lan) | 2 | 4 | 8.260 | 33.040
Total job time: 33.040 sec, wall time: 19.605 sec, speedup: 1.685
2018-03-20 14:11:52 dispy - jobs run: 4
2018-03-20 14:11:52 dispy - Closing node 192.168.1.102 for compute / 1521569492515
2018-03-20 14:11:52 dispy - Shutting down scheduler ...
2018-03-20 14:11:52 dispy - Scheduler quitting: 0
2018-03-20 14:11:52 dispy - Scheduler quit
2018-03-20 14:11:52 pycos - terminating task !timer_proc/4321344864 (daemon)
2018-03-20 14:11:52 pycos - terminating task !udp_server/4325859408 (daemon)
2018-03-20 14:11:52 pycos - terminating task !tcp_server/4322824048 (daemon)
2018-03-20 14:11:52 pycos - pycos terminated
I really want to thank you for all this effort.
The problem was not the client but the server. The dispynode.py was blocked and killed by my company firewall. I needed to add an exception for it and now it works just fine.
Thank you very much.
If you are interested, I will inform you about our progresses in moving from ParallelPython to dispy/pycos.
Thanks again.
Great!
Yes, your progress / comments will be appreciated / helpful.
Sorry to bother again.
With the JobCluster and dispynode (different type of combination) everything looks good.
I was now trying to test SharedJobCluster and dispyscheduler.py option.
Unfortunately I get the following error when I try to run dispyscheduler.py
Traceback (most recent call last):
File "/Users/alfoa/miniconda2/envs/raven_libraries/bin/dispyscheduler.py", line 4, in <module>
__import__('pkg_resources').run_script('dispy==4.8.4', 'dispyscheduler.py')
File "/Users/alfoa/miniconda2/envs/raven_libraries/lib/python2.7/site-packages/pkg_resources/__init__.py", line 750, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/Users/alfoa/miniconda2/envs/raven_libraries/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1527, in run_script
exec(code, namespace, namespace)
File "/Users/alfoa/miniconda2/envs/raven_libraries/lib/python2.7/site-packages/dispy-4.8.4-py2.7.egg/EGG-INFO/scripts/dispyscheduler.py", line 2172, in <module>
scheduler.scheduler_task.value()
AttributeError: '_Scheduler' object has no attribute 'scheduler_task'
Thanks again
I guess you are using it as daemon. Can you replace
scheduler.scheduler_task.value()
with
scheduler.job_scheduler_task.value()
on line 2170 in dispyscheduler.py? I will commit fix later, but github is not stable yet.
Yep. Thanks. I did it. No crash anymore but now problem with binding:
2018-03-20 15:40:47 dispyscheduler - dispyscheduler version 4.8.4
2018-03-20 15:40:47 pycos - version 4.6.5 with kqueue I/O notifier
2018-03-20 15:40:47 dispyscheduler - Could not bind scheduler server to 134.20.218.80:51349
I will work dispynode.py and JobCluster for now to create the infrastructure and in case you find a fix I will grep it later.
Thanks in advance.
Edit: PS. I checked that no other proccess is using that port.
Have you managed to get it working? If this issue is resolved, could you close it?
@pgiri Hi, unfortunately I had to stop for a while. I will get back to this asap. If you want you can close the issue and I will reopen (or email you directly (if you want)) in case I experience the same problems.
Thanks.
You can email me directly, of course. I was wondering if you still have issue(s) with the above.
BTW, current release has 'delegate.py' example to compute different computations in each job, so I will close this issue. You can open a new one in case you have other issues.
I was wondering if there is a way to change the "computation" function each time I invoke a submit. In my project (https://github.com/idaholab/raven) we heavily use parallel computation for several tasks during the same analysis. For doing so, we need to switch from a "computation" to the other relatively often. I would like to avoid to need to "reinitialize" a SharedJobCluster/JobCluster every time I need to evaluate a new function (since I would like to avoid the overhead of the nodes' discovery, etc.).
Do you have any suggestion?