ipython / ipyparallel

IPython Parallel: Interactive Parallel Computing in Python
https://ipyparallel.readthedocs.io/
Other
2.58k stars 998 forks source link

HPC Cluster Problems #428

Open Pas0691 opened 3 years ago

Pas0691 commented 3 years ago

Hey guys very very cool job so far.

I'm not quite sure if that's a hugh issue, but I wasn't able to find a solution by myself.

Goal: I want to implement a pythoncluster on a Windows HPC Cluster

Installed SW: Windows Server 2012 on the Head, HPC Pack 2016 as managment, and Anaconda for management of python.

What I have done so far: Installed all ipcluster dependencies and made a cluster ( ipcluster start -n 2) working without issues. I did not establish connections to any engines yet. I thought that would minimize fault potentials.

Anyway when I'm trying to use the WindowsHPC controller, The cluster does not start up, but fails with:

Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 543, in start_controller self.controller_launcher.start() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 973, in start return super(WindowsHPCControllerLauncher, self).start(1) File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 914, in start output = check_output([self.job_cmd] + args, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 411, in check_output return run(popenargs, stdout=PIPE, timeout=timeout, check=True, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['C:\Program Files\Microsoft HPC Pack 2016\Bin\job.EXE', 'submit', '/jobfile:C:\Users\xxx\.ipython\profile_default\ipcontroller_job.xml', '/scheduler:']' returned non-zero exit status 1. ERROR:tornado.application:Exception in callback functools.partial(<function IPClusterStart.start..start at 0x000000B31324A670>) Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\tornado\ioloop.py", line 743, in _run_callback ret = callback() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 588, in start self.start_controller() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 543, in start_controller self.controller_launcher.start() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 973, in start return super(WindowsHPCControllerLauncher, self).start(1) File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 914, in start output = check_output([self.job_cmd] + args, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 411, in check_output return run(popenargs, stdout=PIPE, timeout=timeout, check=True, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['C:\Program Files\Microsoft HPC Pack 2016\Bin\job.EXE', 'submit', '/jobfile:C:\Users\xxx\.ipython\profile_default\ipcontroller_job.xml', '/scheduler:']' returned non-zero exit status 1.

I thought about wrong paths, but unfortunatly this wasn't a problem. I guess the problem isn't that big but I couldn't dig to the source. I tried to highlight the most intersting part of the message.

minrk commented 3 years ago

Hi! I’m going through and cleaning up old/stale issues on this repo. Sorry for not responding in a reasonable amount of time!

Can you run the job submit command yourself (outside ipcluster) and maybe get better feedback from there? IPCluster has a habit of hiding the useful errors from the underlying system, but the generated ipconroller_job.xml should still exist after it failed to submit it.