vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
269 stars 45 forks source link

Missing tmp files #1513

Closed pgcudahy closed 1 year ago

pgcudahy commented 1 year ago

Hello again, I'm now trying to work through issues with porting my setup to a new cluster and have hit a wall and was hoping you could help. Because SSH requires dual factor authentication I have to have the jupyter notebook, workflow and tasks all on the cluster. When I launch the jupyter notebook it'll run on a random node so I adapt the localhost of my hosts.yml file to that node.

localhost: r209u16n01
hosts:
  r209u16n01:
    address: pgc29@r209u16n01.mccleary.ycrc.yale.edu
    paths:
        home: /home/pgc29
        scratch: /home/pgc29/palmer_scratch
    sos: /home/pgc29/project/conda_envs/sos/bin/sos

And then I have a more general section for submitting jobs to the cluster

  mccleary_scavenge:
    description: McCleary day / scavenge queue
    address: pgc29@mccleary.ycrc.yale.edu
    paths:
        home: /home/pgc29
        scratch: /home/pgc29/palmer_scratch
    sos: /home/pgc29/project/conda_envs/sos/bin/sos
    kill_cmd: scancel {job_id}
    max_cores: 1000
    max_mem: 20000G
    max_running_jobs: 200
    max_walltime: '24:00:00'
    queue_type: pbs
    status_check_interval: 30
    status_cmd: squeue --job {job_id}
    submit_cmd: sbatch {job_file}
    submit_cmd_output: Submitted batch job {job_id}
    task_template: |
        #!/bin/bash
        #SBATCH --time={walltime}
        #SBATCH --nodes={nodes}
        #SBATCH --cpus-per-task={cores}
        #SBATCH --mem-per-cpu={mem // cores // 1000000000}G
        #SBATCH --job-name={task}
        #SBATCH --output=/home/pgc29/.sos/tasks/{task}.out
        #SBATCH --error=/home/pgc29/.sos/tasks/{task}.err
        #SBATCH --partition=day,scavenge
        source /vast/palmer/home.mccleary/pgc29/.bashrc
        conda activate /gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos
        cd {workdir}
        {command}
    workflow_template: |
        #!/bin/bash
        #SBATCH --time={walltime}
        #SBATCH --nodes={nodes}
        #SBATCH --ntasks-per-node={cores}
        #SBATCH --mem={mem}
        #SBATCH --job-name={job_name}
        #SBATCH --output=/home/pgc29/.sos/workflows/{job_name}.out
        #SBATCH --error=/home/pgc29/.sos/workflows/{job_name}.err
        #SBATCH --partition=day,scavenge
        source /vast/palmer/home.mccleary/pgc29/.bashrc
        conda activate /gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos
        {command}

But when I try to run !sos remote -v4 -c /home/pgc29/.sos/hosts.yml test mccleary_scavenge it complains a bit with

DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
Alias:       mccleary_scavenge
Address:     pgc29@mccleary.ycrc.yale.edu
Queue Type:  pbs
ssh:         OK
scp:         OK
sos:         OK
paths:       Failed to receive file from remote host /home/pgc29/: Failed to copy /home/pgc29/.sos_test_26971.txt from mccleary_scavenge using command "rsync -a --no-g -e 'ssh -o 'ControlMaster=auto' -o 'ControlPath=/home/pgc29/.ssh/controlmasters/%r@%h:%p' -o 'ControlPersist=10m' -p 22' pgc29@mccleary.ycrc.yale.edu:/home/pgc29/.sos_test_26971.txt "/home/pgc29"": command return 24
shared:      OK (shared )

First it complains about /gpfs/gibbs/project/cudahy/pgc29 which is the true address of the drive that is linked to /home/pgc29/project. Then rsync fails with code 24. I looked it up and an rsync return code 24 means "the files were existing while rsync was building the list of files to transfer. But they were removed before transferring".

When I try to submit a real job it will submit a job like

#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4GB
#SBATCH --job-name=w4d79b3d8c033d5df
#SBATCH --output=/home/pgc29/.sos/workflows/w4d79b3d8c033d5df.out
#SBATCH --error=/home/pgc29/.sos/workflows/w4d79b3d8c033d5df.err
#SBATCH --partition=day,scavenge
source /vast/palmer/home.mccleary/pgc29/.bashrc
conda activate /gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos
sos run /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qv1pm_su.sos run_vcfProcess_gatk -v4 -q mccleary_scavenge -c ~/.sos/config_mccleary_scavenge.yml -M w4d79b3d8c033d5df

and fail with ValueError: Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qv1pm_su.sos

Any idea why it isn't able to find these tmp files?

pgcudahy commented 1 year ago

I changed my hosts.yml to

hosts:
  r209u16n01:
    address: pgc29@r209u16n01.mccleary.ycrc.yale.edu
    paths:
        home: /home/pgc29
        project: /gpfs/gibbs/project/cudahy/pgc29
    sos: /home/pgc29/project/conda_envs/sos/bin/sos
  mccleary_scavenge:
    description: McCleary day / scavenge queue
    address: pgc29@mccleary.ycrc.yale.edu
    paths:
        home: /home/pgc29
        project: /gpfs/gibbs/project/cudahy/pgc29
    sos: /home/pgc29/project/conda_envs/sos/bin/sos
...

But sos remote test still complains that

DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
Alias:       r209u16n01
Address:     pgc29@r209u16n01.mccleary.ycrc.yale.edu
Queue Type:  process
ssh:         OK
scp:         OK
sos:         OK
paths:       No path_map between local and remote host.
shared:      shared directory / not in path_map

And real jobs still fail in the same way. My notebook is running in a directory within /gpfs/gibbs/project/cudahy/pgc29 but I think it's somehow failing to write the .tmp_script.sos file?

BoPeng commented 1 year ago

The first thing to check is whether the head node and all computing nodes have access to /gpfs/gibbs/project/cudahy/pgc29.

BoPeng commented 1 year ago

@pgcudahy Just to clarify, did you

  1. submit a single-node job to a working node, then let the working node submit more jobs using the task mechanism, or
  2. submit a multi-node job and let sos distribute jobs to all nodes?

The trouble with option 1 is that working nodes need ssh access to headnode, which is not always feasible.

pgcudahy commented 1 year ago

Thanks for your help with this. I checked and the directory is available to all nodes. I'm submitting a single node job with

%sosrun test -v4 \
-c /home/pgc29/.sos/hosts.yml \
-q mccleary_scavenge \
-r mccleary_scavenge \
mem="4GB" cores=1 walltime="24:00:00" nodes=1

[test]
output: f'/home/pgc29/test.out'
task: walltime='00:02:00', mem='1G', cores=1, nodes=1, 
    workdir='/home/pgc29'

run: expand=True
    touch {_output}

Fails with

$ cat .sos/workflows/wc47a7467aab75a37.err 
DEBUG: Failed to report to monitor process: cannot access local variable 'm' where it is not associated with a value
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
    script = SoS_Script(filename=args.script)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traceback (most recent call last):
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
    script = SoS_Script(filename=args.script)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/parser.py", line 862, in __init__
    content, self.sos_script = locate_script(filename, start=".")
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/utils.py", line 917, in locate_script
    raise ValueError(f"Failed to locate {filename}")
ValueError: Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 2847, in main
    args.func(args, workflow_args)
Traceback (most recent call last):
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
    script = SoS_Script(filename=args.script)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/parser.py", line 862, in __init__
    content, self.sos_script = locate_script(filename, start=".")
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/utils.py", line 917, in locate_script
    raise ValueError(f"Failed to locate {filename}")
ValueError: Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 2847, in main
    args.func(args, workflow_args)
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 725, in cmd_run
    env.logger.error(str(e))
                         ^
UnboundLocalError: cannot access local variable 'e' where it is not associated with a value
ERROR: cannot access local variable 'e' where it is not associated with a value
BoPeng commented 1 year ago

Let me try to reproduce this, but as i said above, there are two ways to run a multi-node job on the cluster,

  1. (pseudo command) qsub cores=1 sos run -q cluster. This requires the head node to be accessible from computing nodes, which is not the case on my cluster.
  2. qsub cores=10 sos run -q none -j 10. This will allocate 10 nodes and let sos distribute jobs directly to computing nodes. This is supposed to be faster for a large number of smaller tasks since there is no overhead of creating and monitoring a large number of tasks.

Let me see if I can make your example work with both options.

pgcudahy commented 1 year ago

One thing I realized was wrong was that I should be using shared: rather than path: in my hosts.yml so I changed the localhost to

hosts:
  r209u16n01:
    address: pgc29@localhost
    shared:
      home: /vast/palmer/home.mccleary/pgc29/
      project: /gpfs/gibbs/project/cudahy/pgc29/
      scratch60: /vast/palmer/scratch/cudahy/pgc29/
    sos: /gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/bin/sos

But when I run !sos remote -v4 -c /home/pgc29/.sos/hosts.yml test r209u16n01 I still get

DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
Alias:       r209u02n01
Address:     pgc29@localhost
Queue Type:  process
ssh:         OK
scp:         OK
sos:         OK
paths:       No path_map between local and remote host.
shared:      shared directory / not in path_map

I did notice that this is run from a notebook within /gpfs/gibbs/project/cudahy/pgc29, and when I move the notebook to /vast/palmer/home.mccleary/pgc29/ !sos remote -v4 -c /home/pgc29/.sos/hosts.yml test r209u16n01 complains with

DEBUG: Path /vast/palmer/home.mccleary/pgc29 is not under any specified paths of localhost and is mapped to /vast/palmer/home.mccleary/pgc29 on remote host.
DEBUG: Path /vast/palmer/home.mccleary/pgc29 is not under any specified paths of localhost and is mapped to /vast/palmer/home.mccleary/pgc29 on remote host.
Alias:       r209u02n01
Address:     pgc29@localhost
Queue Type:  process
ssh:         OK
scp:         OK
sos:         OK
paths:       No path_map between local and remote host.
shared:      shared directory / not in path_map

Even with these changes I still get the same Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos error when submitting real jobs

BoPeng commented 1 year ago

@pgcudahy Could you try to execute the same workflow from the jupyter command line? Basically, could you please create a test.sos file with the [test] workflow you have, and create a terminal from Juypyter lab, and execute the workflow with sos run -r -q ... test.sos? Right now I suspect that sos notebook removes the temporary script before the remote host read and execute it.

pgcudahy commented 1 year ago

Yup, submitting that way works just fine

BoPeng commented 1 year ago

OK, then this is a problem with sos-notebook. I am patching sos-notebook now.

BoPeng commented 1 year ago

@pgcudahy Please let me know if the problem has been addressed with sos notebook 0.24.1.

pgcudahy commented 1 year ago

Works well! Thanks so much for helping me with these edge cases.