Closed pgcudahy closed 1 year ago
I changed my hosts.yml to
hosts:
r209u16n01:
address: pgc29@r209u16n01.mccleary.ycrc.yale.edu
paths:
home: /home/pgc29
project: /gpfs/gibbs/project/cudahy/pgc29
sos: /home/pgc29/project/conda_envs/sos/bin/sos
mccleary_scavenge:
description: McCleary day / scavenge queue
address: pgc29@mccleary.ycrc.yale.edu
paths:
home: /home/pgc29
project: /gpfs/gibbs/project/cudahy/pgc29
sos: /home/pgc29/project/conda_envs/sos/bin/sos
...
But sos remote test
still complains that
DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
Alias: r209u16n01
Address: pgc29@r209u16n01.mccleary.ycrc.yale.edu
Queue Type: process
ssh: OK
scp: OK
sos: OK
paths: No path_map between local and remote host.
shared: shared directory / not in path_map
And real jobs still fail in the same way. My notebook is running in a directory within /gpfs/gibbs/project/cudahy/pgc29 but I think it's somehow failing to write the .tmp_script.sos file?
The first thing to check is whether the head node and all computing nodes have access to /gpfs/gibbs/project/cudahy/pgc29
.
@pgcudahy Just to clarify, did you
task
mechanism, orThe trouble with option 1 is that working nodes need ssh access to headnode, which is not always feasible.
Thanks for your help with this. I checked and the directory is available to all nodes. I'm submitting a single node job with
%sosrun test -v4 \
-c /home/pgc29/.sos/hosts.yml \
-q mccleary_scavenge \
-r mccleary_scavenge \
mem="4GB" cores=1 walltime="24:00:00" nodes=1
[test]
output: f'/home/pgc29/test.out'
task: walltime='00:02:00', mem='1G', cores=1, nodes=1,
workdir='/home/pgc29'
run: expand=True
touch {_output}
Fails with
$ cat .sos/workflows/wc47a7467aab75a37.err
DEBUG: Failed to report to monitor process: cannot access local variable 'm' where it is not associated with a value
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
script = SoS_Script(filename=args.script)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traceback (most recent call last):
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
script = SoS_Script(filename=args.script)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/parser.py", line 862, in __init__
content, self.sos_script = locate_script(filename, start=".")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/utils.py", line 917, in locate_script
raise ValueError(f"Failed to locate {filename}")
ValueError: Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 2847, in main
args.func(args, workflow_args)
Traceback (most recent call last):
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
script = SoS_Script(filename=args.script)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/parser.py", line 862, in __init__
content, self.sos_script = locate_script(filename, start=".")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/utils.py", line 917, in locate_script
raise ValueError(f"Failed to locate {filename}")
ValueError: Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 2847, in main
args.func(args, workflow_args)
File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 725, in cmd_run
env.logger.error(str(e))
^
UnboundLocalError: cannot access local variable 'e' where it is not associated with a value
ERROR: cannot access local variable 'e' where it is not associated with a value
Let me try to reproduce this, but as i said above, there are two ways to run a multi-node job on the cluster,
qsub cores=1 sos run -q cluster
. This requires the head node to be accessible from computing nodes, which is not the case on my cluster.qsub cores=10 sos run -q none -j 10
. This will allocate 10 nodes and let sos distribute jobs directly to computing nodes. This is supposed to be faster for a large number of smaller tasks since there is no overhead of creating and monitoring a large number of tasks.Let me see if I can make your example work with both options.
One thing I realized was wrong was that I should be using shared:
rather than path:
in my hosts.yml so I changed the localhost to
hosts:
r209u16n01:
address: pgc29@localhost
shared:
home: /vast/palmer/home.mccleary/pgc29/
project: /gpfs/gibbs/project/cudahy/pgc29/
scratch60: /vast/palmer/scratch/cudahy/pgc29/
sos: /gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/bin/sos
But when I run !sos remote -v4 -c /home/pgc29/.sos/hosts.yml test r209u16n01
I still get
DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
Alias: r209u02n01
Address: pgc29@localhost
Queue Type: process
ssh: OK
scp: OK
sos: OK
paths: No path_map between local and remote host.
shared: shared directory / not in path_map
I did notice that this is run from a notebook within /gpfs/gibbs/project/cudahy/pgc29
, and when I move the notebook to /vast/palmer/home.mccleary/pgc29/
!sos remote -v4 -c /home/pgc29/.sos/hosts.yml test r209u16n01
complains with
DEBUG: Path /vast/palmer/home.mccleary/pgc29 is not under any specified paths of localhost and is mapped to /vast/palmer/home.mccleary/pgc29 on remote host.
DEBUG: Path /vast/palmer/home.mccleary/pgc29 is not under any specified paths of localhost and is mapped to /vast/palmer/home.mccleary/pgc29 on remote host.
Alias: r209u02n01
Address: pgc29@localhost
Queue Type: process
ssh: OK
scp: OK
sos: OK
paths: No path_map between local and remote host.
shared: shared directory / not in path_map
Even with these changes I still get the same Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos
error when submitting real jobs
@pgcudahy Could you try to execute the same workflow from the jupyter command line? Basically, could you please create a test.sos
file with the [test]
workflow you have, and create a terminal from Juypyter lab, and execute the workflow with sos run -r -q ... test.sos
? Right now I suspect that sos notebook
removes the temporary script before the remote host read and execute it.
Yup, submitting that way works just fine
OK, then this is a problem with sos-notebook
. I am patching sos-notebook
now.
@pgcudahy Please let me know if the problem has been addressed with sos notebook
0.24.1.
Works well! Thanks so much for helping me with these edge cases.
Hello again, I'm now trying to work through issues with porting my setup to a new cluster and have hit a wall and was hoping you could help. Because SSH requires dual factor authentication I have to have the jupyter notebook, workflow and tasks all on the cluster. When I launch the jupyter notebook it'll run on a random node so I adapt the
localhost
of my hosts.yml file to that node.And then I have a more general section for submitting jobs to the cluster
But when I try to run
!sos remote -v4 -c /home/pgc29/.sos/hosts.yml test mccleary_scavenge
it complains a bit withFirst it complains about
/gpfs/gibbs/project/cudahy/pgc29
which is the true address of the drive that is linked to /home/pgc29/project. Then rsync fails with code 24. I looked it up and an rsync return code 24 means "the files were existing while rsync was building the list of files to transfer. But they were removed before transferring".When I try to submit a real job it will submit a job like
and fail with
ValueError: Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qv1pm_su.sos
Any idea why it isn't able to find these tmp files?