soravux / scoop

SCOOP (Scalable COncurrent Operations in Python)
https://github.com/soravux/scoop
GNU Lesser General Public License v3.0
636 stars 87 forks source link

Error with SLURM #25

Closed pmolea closed 8 years ago

pmolea commented 9 years ago

I'm trying to use scoop in a cluster that uses SLURM. I'm trying to run the example you provide in the documentation (helloworld example). I've run the example in the head node with few cpu's and it works (so it seems installation is correct up to some level at least), but when I run it through sbatch it returns the following error:

EXECUTE PYTHON .PY FILE Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/user/.local/lib/python2.7/site-packages/scoop/main.py", line 21, in main() File "/home/user/.local/lib/python2.7/site-packages/scoop/launcher.py", line 454, in main args.external_hostname = [utils.externalHostname(hosts)] File "/home/user/.local/lib/python2.7/site-packages/scoop/utils.py", line 101, in externalHostname hostname = hosts[0][0] IndexError: list index out of range END OF JOBS

In the documentation I read scoop is compatible with slurm, is there a particular configuration step that is not documented (the SSH keys are already configured)?

Thanks,

inJeans commented 9 years ago

I had a similar error yesterday, I managed to fix it by creating a hostfile as described in the scoop docs. To find out the names of the hosts in my current session I just ran

srun bash -c "echo \$HOSTNAME"

then you can just put the names that were output into a hostfile. I hope that helps :smiley:

croessert commented 9 years ago

I tried to use this with:

hosts=$(srun bash -c hostname)
python -m scoop --host $hosts script.py

However I receive the following output and the python scripts are stuck, usage per core is at 1%.

[2015-08-23 22:25:13,652] launcher  INFO    SCOOP 0.7.1 dev on linux2 using Python 2.7.9 (default, Apr 27 2015, 11:34:09) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)], API: 1013
[2015-08-23 22:25:13,652] launcher  INFO    Detected SLURM environment.
[2015-08-23 22:25:13,652] launcher  INFO    Deploying 32 worker(s) over 3 host(s).
[2015-08-23 22:25:13,652] launcher  INFO    Worker distribution:
[2015-08-23 22:25:13,652] launcher  INFO       node001:       15 + origin
[2015-08-23 22:25:13,652] launcher  INFO       node002:       15 + origin
[2015-08-23 22:25:20,368] __init__  (127.0.0.1:54413) INFO    Launching advertiser...
[2015-08-23 22:25:20,370] __init__  (127.0.0.1:54413) INFO    Advertiser launched.

Without --host I receive this output and the scripts are working:

[2015-08-23 22:01:31,126] launcher  INFO    SCOOP 0.7.1 dev on linux2 using Python 2.7.9 (default, Apr 27 2015, 11:34:09) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)], API: 1013
[2015-08-23 22:01:31,126] launcher  INFO    Detected SLURM environment.
[2015-08-23 22:01:31,126] launcher  INFO    Deploying 32 worker(s) over 2 host(s).
[2015-08-23 22:01:31,126] launcher  INFO    Worker distribution: 
[2015-08-23 22:01:31,126] launcher  INFO       node001: 15 + origin
[2015-08-23 22:01:31,127] launcher  INFO       node002: 16 
croessert commented 9 years ago

There seems to be an error in your parser, ending the --host argument with another argument e.g. -v works:

hosts=$(srun bash -c hostname)
python -m scoop --host $hosts -v script.py
soravux commented 8 years ago

Seems the same issue as #26.