Open Pas0691 opened 3 years ago
Hi! I’m going through and cleaning up old/stale issues on this repo. Sorry for not responding in a reasonable amount of time!
Can you run the job submit command yourself (outside ipcluster) and maybe get better feedback from there? IPCluster has a habit of hiding the useful errors from the underlying system, but the generated ipconroller_job.xml should still exist after it failed to submit it.
Hey guys very very cool job so far.
I'm not quite sure if that's a hugh issue, but I wasn't able to find a solution by myself.
Goal: I want to implement a pythoncluster on a Windows HPC Cluster
Installed SW: Windows Server 2012 on the Head, HPC Pack 2016 as managment, and Anaconda for management of python.
What I have done so far: Installed all ipcluster dependencies and made a cluster ( ipcluster start -n 2) working without issues. I did not establish connections to any engines yet. I thought that would minimize fault potentials.
Anyway when I'm trying to use the WindowsHPC controller, The cluster does not start up, but fails with:
Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 543, in start_controller self.controller_launcher.start() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 973, in start return super(WindowsHPCControllerLauncher, self).start(1) File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 914, in start output = check_output([self.job_cmd] + args, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 411, in check_output return run(popenargs, stdout=PIPE, timeout=timeout, check=True, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['C:\Program Files\Microsoft HPC Pack 2016\Bin\job.EXE', 'submit', '/jobfile:C:\Users\xxx\.ipython\profile_default\ipcontroller_job.xml', '/scheduler:']' returned non-zero exit status 1. ERROR:tornado.application:Exception in callback functools.partial(<function IPClusterStart.start..start at 0x000000B31324A670>)
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\tornado\ioloop.py", line 743, in _run_callback
ret = callback()
File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 588, in start
self.start_controller()
File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 543, in start_controller
self.controller_launcher.start()
File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 973, in start
return super(WindowsHPCControllerLauncher, self).start(1)
File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 914, in start
output = check_output([self.job_cmd] + args,
File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 411, in check_output
return run(popenargs, stdout=PIPE, timeout=timeout, check=True,
File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\Program Files\Microsoft HPC Pack 2016\Bin\job.EXE', 'submit', '/jobfile:C:\Users\xxx\.ipython\profile_default\ipcontroller_job.xml',
'/scheduler:']' returned non-zero exit status 1.
I thought about wrong paths, but unfortunatly this wasn't a problem. I guess the problem isn't that big but I couldn't dig to the source. I tried to highlight the most intersting part of the message.