Closed pmolea closed 8 years ago
I had a similar error yesterday, I managed to fix it by creating a hostfile as described in the scoop docs. To find out the names of the hosts in my current session I just ran
srun bash -c "echo \$HOSTNAME"
then you can just put the names that were output into a hostfile. I hope that helps :smiley:
I tried to use this with:
hosts=$(srun bash -c hostname)
python -m scoop --host $hosts script.py
However I receive the following output and the python scripts are stuck, usage per core is at 1%.
[2015-08-23 22:25:13,652] launcher INFO SCOOP 0.7.1 dev on linux2 using Python 2.7.9 (default, Apr 27 2015, 11:34:09) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)], API: 1013
[2015-08-23 22:25:13,652] launcher INFO Detected SLURM environment.
[2015-08-23 22:25:13,652] launcher INFO Deploying 32 worker(s) over 3 host(s).
[2015-08-23 22:25:13,652] launcher INFO Worker distribution:
[2015-08-23 22:25:13,652] launcher INFO node001: 15 + origin
[2015-08-23 22:25:13,652] launcher INFO node002: 15 + origin
[2015-08-23 22:25:20,368] __init__ (127.0.0.1:54413) INFO Launching advertiser...
[2015-08-23 22:25:20,370] __init__ (127.0.0.1:54413) INFO Advertiser launched.
Without --host I receive this output and the scripts are working:
[2015-08-23 22:01:31,126] launcher INFO SCOOP 0.7.1 dev on linux2 using Python 2.7.9 (default, Apr 27 2015, 11:34:09) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)], API: 1013
[2015-08-23 22:01:31,126] launcher INFO Detected SLURM environment.
[2015-08-23 22:01:31,126] launcher INFO Deploying 32 worker(s) over 2 host(s).
[2015-08-23 22:01:31,126] launcher INFO Worker distribution:
[2015-08-23 22:01:31,126] launcher INFO node001: 15 + origin
[2015-08-23 22:01:31,127] launcher INFO node002: 16
There seems to be an error in your parser, ending the --host argument with another argument e.g. -v works:
hosts=$(srun bash -c hostname)
python -m scoop --host $hosts -v script.py
Seems the same issue as #26.
I'm trying to use scoop in a cluster that uses SLURM. I'm trying to run the example you provide in the documentation (helloworld example). I've run the example in the head node with few cpu's and it works (so it seems installation is correct up to some level at least), but when I run it through sbatch it returns the following error:
EXECUTE PYTHON .PY FILE Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/user/.local/lib/python2.7/site-packages/scoop/main.py", line 21, in
main()
File "/home/user/.local/lib/python2.7/site-packages/scoop/launcher.py", line 454, in main
args.external_hostname = [utils.externalHostname(hosts)]
File "/home/user/.local/lib/python2.7/site-packages/scoop/utils.py", line 101, in externalHostname
hostname = hosts[0][0]
IndexError: list index out of range
END OF JOBS
In the documentation I read scoop is compatible with slurm, is there a particular configuration step that is not documented (the SSH keys are already configured)?
Thanks,