snakemake / snakemake-executor-plugin-googlebatch

Snakemake executor plugin for Google Batch (under development)
MIT License
3 stars 5 forks source link

local() file not uploaded to storage #17

Closed vsoch closed 7 months ago

vsoch commented 7 months ago

@johanneskoester I was able to get my workflow to run by designating the pi_MPI.c as local, e.g., the input here:

rule compile:
    input:
        local("pi_MPI.c"),
    output:
        "pi_MPI",
    log:
        "logs/compile.log",
    resources:
        mem_mb=0,
    shell:
        "mpicc -o {output} {input} &> {log}"

but then it doesn't seem to upload to storage and down to the worker (and is not present in my working directory):

Retrieving input from storage.
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=0, mem_mib=0
Select jobs to execute...
Execute 1 jobs...
[Sat Dec  9 05:05:23 2023]
localrule compile:
    input: pi_MPI.c
    output: s3://snakemake-testing-llnl/pi_MPI (send to storage)
    log: s3://snakemake-testing-llnl/logs/compile.log (send to storage)
    jobid: 0
    reason: Forced execution
    resources: mem_mb=0, mem_mib=0, disk_mb=<TBD>, tmpdir=/tmp
...
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/snakemake/jobs.py", line 739, in prepare
    await wait_for_files(
  File "/opt/conda/lib/python3.11/site-packages/snakemake/io.py", line 922, in wait_for_files
    raise IOError(
OSError: Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
pi_MPI.c

I tried adding a copy step, but I don't think it worked:

rule copy:
    input:
        local("pi_MPI.c"),
    output:
        "pi_MPI.c",
    log:
        "logs/copy.log",
    resources:
        mem_mb=0,
    shell:
        "cp {input} {output} &> {log}"

I thought that might get me a little further, but it still couldn't find it:

    |     (await self.storage_object.managed_mtime()) if self.is_storage else None
    |      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/opt/conda/lib/python3.11/site-packages/snakemake_interface_storage_plugins/storage_object.py", line 158, in managed_mtime
    |     raise WorkflowError(f"Failed to get mtime of {self.query}", e)
    | snakemake_interface_common.exceptions.WorkflowError: Failed to get mtime of s3://snakemake-testing-llnl/pi_MPI.c
    | ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
    +------------------------------------

To step back - how do I tell snakemake to take my local file, put it in storage, and then use it for this step?

johanneskoester commented 7 months ago

in principle you had the right intuition. I will have to test why this did not work.

johanneskoester commented 7 months ago

Actually, jobs with local input files should automatically become local jobs as well, and not run in the cloud.

johanneskoester commented 7 months ago

So, two things to change here. I will try to do that ASAP.

johanneskoester commented 7 months ago

Fix should be here: https://github.com/snakemake/snakemake/pull/2541

vsoch commented 7 months ago

okay I tried the run last night (with the copy step) and the file was 404 in storage - I'll try it again today (I think my computer shut down last night and maybe that was related).

vsoch commented 7 months ago

Update (ran again)! Here is the current Snakefile:

# https://github.com/snakemake/snakemake/blob/main/tests/test_slurm_mpi/Snakefile
# Note that in reality, the mpi, account, and partition resources should be specified
# via --default-resources, in order to keep such infrastructure specific details out of the
# workflow definition.

localrules:
    all,
    clean,
    copy,

rule all:
    input:
        "pi.calc",

rule clean:
    shell:
        "rm -f pi.calc"

rule copy:
    input:
        local("pi_MPI.c"),
    output:
        "pi_MPI.c",
    log:
        "logs/copy.log",
    resources:
        mem_mb=0,
    shell:
        "cp {input} {output} &> {log}"

# TODO need to flag this with a wrapper only
rule compile:
    input:
        "pi_MPI.c",
    output:
        "pi_MPI",
    log:
        "logs/compile.log",
    resources:
        mem_mb=0,
    shell:
        "mpicc -o {output} {input} &> {log}"

rule calc_pi:
    input:
        "pi_MPI",
    output:
        "pi.calc",
    log:
        "logs/calc_pi.log",
    resources:
        mem_mb=0,
        tasks=1,
        mpi="mpiexec",
    shell:
        # todo where does ppn go?
        "{resources.mpi} -hostfile $BATCH_HOSTS_FILE -n {resources.tasks} {input} 10 > {output} 2> {log}"

I can see in my terminal that it knows to send to storage:

[Mon Dec 11 10:52:52 2023]
localrule copy:
    input: pi_MPI.c
    output: s3://snakemake-testing-llnl/pi_MPI.c (send to storage)
    log: s3://snakemake-testing-llnl/logs/copy.log (send to storage)
    jobid: 3
    reason: Missing output files: s3://snakemake-testing-llnl/pi_MPI.c (send to storage)
    resources: tmpdir=/tmp, mem_mb=0, mem_mib=0

The log is empty, and I don't see the file in storage.

To step back - is there any reason this small file (in my local PWD) would not be uploaded into the working directory context? Why do I need the explicit copy step? And given I have the copy step, why does it seem to report "green" that it's working but the file isn't there? :thinking: Let me know what I might be doing wrong / what we should try next.

johanneskoester commented 7 months ago

Mhm, this is definitely some kind of bug. It will not "automatically" just upload any local files. It would rather automatically turn jobs with local files into local jobs (not running in the remote executor but on the host).

Can you please post the full log? I think it is related to the upload logic for local jobs. There might be a bug.

vsoch commented 7 months ago

Can you please post the full log? I think it is related to the upload logic for local jobs. There might be a bug.

The full log for copy? It appears to be an empty file:

cat example/hello-world-intel-mpi/.snakemake/storage/s3/snakemake-testing-llnl/logs/copy.log 
# no output

Here is what I see in the terminal:

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3 --default-storage-prefix s3://snakemake-testing-llnl --googlebatch-snippets intel-mpi
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job        count
-------  -------
all            1
calc_pi        1
compile        1
copy           1
total          4

Select jobs to execute...
Execute 1 jobs...

[Mon Dec 11 10:52:52 2023]
localrule copy:
    input: pi_MPI.c
    output: s3://snakemake-testing-llnl/pi_MPI.c (send to storage)
    log: s3://snakemake-testing-llnl/logs/copy.log (send to storage)
    jobid: 3
    reason: Missing output files: s3://snakemake-testing-llnl/pi_MPI.c (send to storage)
    resources: tmpdir=/tmp, mem_mb=0, mem_mib=0

[Mon Dec 11 10:52:53 2023]
Finished job 3.
1 of 4 steps (25%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Dec 11 10:52:53 2023]
rule compile:
    input: s3://snakemake-testing-llnl/pi_MPI.c (retrieve from storage)
    output: s3://snakemake-testing-llnl/pi_MPI (send to storage)
    log: s3://snakemake-testing-llnl/logs/compile.log (send to storage)
    jobid: 2
    reason: Missing output files: s3://snakemake-testing-llnl/pi_MPI (send to storage); Input files updated by another job: s3://snakemake-testing-llnl/pi_MPI.c (retrieve from storage)
    resources: tmpdir=<TBD>, mem_mb=0, mem_mib=0

🌟️ Setup Command:
export HOME=/root
export PATH=/opt/conda/bin:${PATH}
export LANG=C.UTF-8
export SHELL=/bin/bash

sudo yum update -y
sudo yum install -y wget bzip2 ca-certificates gnupg2 squashfs-tools git
cat <<EOF > ./Snakefile
# https://github.com/snakemake/snakemake/blob/main/tests/test_slurm_mpi/Snakefile
# Note that in reality, the mpi, account, and partition resources should be specified
# via --default-resources, in order to keep such infrastructure specific details out of the
# workflow definition.

localrules:
    all,
    clean,
    copy,

rule all:
    input:
        "pi.calc",

rule clean:
    shell:
        "rm -f pi.calc"

rule copy:
    input:
        local("pi_MPI.c"),
    output:
        "pi_MPI.c",
    log:
        "logs/copy.log",
    resources:
        mem_mb=0,
    shell:
        "cp {input} {output} &> {log}"

# TODO need to flag this with a wrapper only
rule compile:
    input:
        "pi_MPI.c",
    output:
        "pi_MPI",
    log:
        "logs/compile.log",
    resources:
        mem_mb=0,
    shell:
        "mpicc -o {output} {input} &> {log}"

rule calc_pi:
    input:
        "pi_MPI",
    output:
        "pi.calc",
    log:
        "logs/calc_pi.log",
    resources:
        mem_mb=0,
        tasks=1,
        mpi="mpiexec",
    shell:
        # todo where does ppn go?
        "{resources.mpi} -hostfile $BATCH_HOSTS_FILE -n {resources.tasks} {input} 10 > {output} 2> {log}"
EOF
cat ./Snakefile
sleep $BATCH_TASK_INDEX

# Note that for this family / image, we are root (do not need sudo)
yum update -y && yum install -y cmake gcc tuned ethtool

# This ONLY works on the hpc-* image family images
google_mpi_tuning --nosmt
# google_install_mpi --intel_mpi
google_install_intelmpi --impi_2021
source /opt/intel/mpi/latest/env/vars.sh

# This is where they are installed to
# ls /opt/intel/mpi/latest/
export PATH=/opt/intel/mpi/latest/bin:$PATH
MPI_LD_PATH=/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${MPI_LD_PATH}
# Only the main job should install conda (rest can use it)
echo "I am batch index ${BATCH_TASK_INDEX}"
export PATH=/opt/conda/bin:${PATH}
if [ $BATCH_TASK_INDEX = 0 ] && [ ! -d "/opt/conda" ] ; then
    workdir=$(pwd)
    url=https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    wget ${url} -O ./miniconda.sh
    chmod +x ./miniconda.sh
    bash ./miniconda.sh -b -u -p /opt/conda
    rm -rf ./miniconda.sh
    conda config --system --set channel_priority strict
    which python
    /opt/conda/bin/python --version
    url=https://github.com/snakemake/snakemake-interface-common
    git clone --depth 1 ${url} /tmp/snakemake-common
    cd /tmp/snakemake-common
    /opt/conda/bin/python -m pip install .
    url=https://github.com/snakemake/snakemake-interface-executor-plugins
    git clone --depth 1 ${url} /tmp/snakemake-plugin
    cd /tmp/snakemake-plugin
    /opt/conda/bin/python -m pip install .
    git clone --depth 1 https://github.com/snakemake/snakemake /tmp/snakemake
    cd /tmp/snakemake
    /opt/conda/bin/python -m pip install .
    cd ${workdir}
fi

🐍️ Snakemake Command:

export HOME=/root
export PATH=/opt/conda/bin:${PATH}
export LANG=C.UTF-8
export SHELL=/bin/bash

$(pwd)
ls
which snakemake || whereis snakemake
export PATH=/opt/intel/mpi/latest/bin:$PATH
MPI_LD_PATH=/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${MPI_LD_PATH}

find /opt/intel -name mpicc

# This is important - it won't work without sourcing
source /opt/intel/mpi/latest/env/vars.sh

if [ $BATCH_TASK_INDEX = 0 ]; then
  ls
  which mpirun
  echo "pip install --target '.snakemake/pip-deployments' snakemake-storage-plugin-s3 && python -m snakemake --deploy-sources s3://snakemake-testing-llnl/snakemake-workflow-sources.0031744e6280b8613a91533c9ebe9530ee78e44f17f78854c9e678a1971d4203.tar.xz 0031744e6280b8613a91533c9ebe9530ee78e44f17f78854c9e678a1971d4203 --default-storage-prefix 's3://snakemake-testing-llnl' --default-storage-provider 's3' --storage-s3-retries 5 && python -m snakemake --snakefile 'Snakefile' --target-jobs 'compile:' --allowed-rules 'compile' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=0' 'mem_mib=0'  --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers mtime params input code software-env --conda-frontend 'mamba' --shared-fs-usage 'none' --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 5 --scheduler 'greedy' --storage-s3-retries 5 --default-storage-prefix 's3://snakemake-testing-llnl' --default-storage-provider 's3' --default-resources 'tmpdir=system_tmpdir' --mode 'remote'"
  pip install --target '.snakemake/pip-deployments' snakemake-storage-plugin-s3 && python -m snakemake --deploy-sources s3://snakemake-testing-llnl/snakemake-workflow-sources.0031744e6280b8613a91533c9ebe9530ee78e44f17f78854c9e678a1971d4203.tar.xz 0031744e6280b8613a91533c9ebe9530ee78e44f17f78854c9e678a1971d4203 --default-storage-prefix 's3://snakemake-testing-llnl' --default-storage-provider 's3' --storage-s3-retries 5 && python -m snakemake --snakefile 'Snakefile' --target-jobs 'compile:' --allowed-rules 'compile' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=0' 'mem_mib=0'  --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers mtime params input code software-env --conda-frontend 'mamba' --shared-fs-usage 'none' --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 5 --scheduler 'greedy' --storage-s3-retries 5 --default-storage-prefix 's3://snakemake-testing-llnl' --default-storage-provider 's3' --default-resources 'tmpdir=system_tmpdir' --mode 'remote'
fi
Job projects/llnl-flux/locations/us-central1/jobs/compile-4c2d49 has state SCHEDULED
STATUS_CHANGED: Job state is set from QUEUED to SCHEDULED for job projects/1040347784593/locations/us-central1/jobs/compile-4c2d49.
Job projects/llnl-flux/locations/us-central1/jobs/compile-4c2d49 has state SCHEDULED

You can safely ignore everything after copy - that is a lot of extra printing for my own FYI as I develop! Also note there are a bunch of these errors all along the way (I wonder if that is related to the file not uploading)?

/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/urllib3/connectionpool.py:1061: InsecureRequestWarning: Unverified HTTPS request is being made to host 'snakemake-testing-llnl.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings

I can confirm that I don't see "copy" as a job in batch anymore, so it is running locally. Also I don't know if this is relevant, but I see that file in the logs directory? (I didn't put it there!) :laughing:

$ cat example/hello-world-intel-mpi/.snakemake/storage/s3/snakemake-testing-llnl/
logs/     pi_MPI.c  
johanneskoester commented 7 months ago

Ok, I think I have fixed it in the main branch. The storage upload was deactivated for local jobs. Now it should properly upload the files.

Regarding this:

/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/urllib3/connectionpool.py:1061: InsecureRequestWarning: Unverified HTTPS request is being made to host 'snakemake-testing-llnl.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings

It is not related, but I would definitely love to fix this.

johanneskoester commented 7 months ago

It is a boto issue: https://github.com/boto/botocore/issues/2630

vsoch commented 7 months ago

That worked! We made it through the compile step, and I see pi_MPI in storage. Now it looks like the mpiexec (same workflow above) is failing but I don't see any error why:

image

The error I see after that is a failure to delete it from storage (but it's an output file so it wasn't generated yet). This is what I see locally, and I can't see any output files generated:

Job projects/llnl-flux/locations/us-central1/jobs/compile-23bd66 has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/compile-23bd66 has state SUCCEEDED
[Tue Dec 12 05:52:34 2023]
Finished job 2.
2 of 4 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Tue Dec 12 05:52:34 2023]
rule calc_pi:
    input: s3://snakemake-testing-llnl/pi_MPI (retrieve from storage)
    output: s3://snakemake-testing-llnl/pi.calc (send to storage)
    log: s3://snakemake-testing-llnl/logs/calc_pi.log (send to storage)
    jobid: 1
    reason: Missing output files: s3://snakemake-testing-llnl/pi.calc (send to storage); Input files updated by another job: s3://snakemake-testing-llnl/pi_MPI (retrieve from storage)
    resources: tmpdir=<TBD>, mem_mb=0, mem_mib=0, tasks=1, mpi=mpiexec

🌟️ Setup Command:

🐍️ Snakemake Command:
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
STATUS_CHANGED: Job state is set from QUEUED to SCHEDULED for job projects/1040347784593/locations/us-central1/jobs/calc-pi-0689fe.
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state FAILED
[Tue Dec 12 06:01:57 2023]
Error in rule calc_pi:
    message: Google Batch job 'projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe' failed. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 1
    input: s3://snakemake-testing-llnl/pi_MPI (retrieve from storage)
    output: s3://snakemake-testing-llnl/pi.calc (send to storage)
    log: s3://snakemake-testing-llnl/logs/calc_pi.log (send to storage), .snakemake/googlebatch_logs/calc_pi.log (check log file(s) for error details)
    shell:
        mpiexec -hostfile $BATCH_HOSTS_FILE -n 1 .snakemake/storage/s3/snakemake-testing-llnl/pi_MPI 10 > .snakemake/storage/s3/snakemake-testing-llnl/pi.calc 2> .snakemake/storage/s3/snakemake-testing-llnl/logs/calc_pi.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-12-12T054309.090988.snakemake.log
WorkflowError:
At least one job did not complete successfully.

I'm guessing mpiexec returned exit code non zero, but it's not clear why (it doesn't give me a log or clear error but just tells me it can't find the output for storage). Is there a good way to debug this?

johanneskoester commented 7 months ago

Indeed mpiexec seems to fail. I would try it locally first, that should be easier to debug. I am by no means an mpi expert, so that I am of little help here. Have you checked that log file you pipe into? (Could be that it is not uploaded right? This is an error I can fix)

johanneskoester commented 7 months ago

I think I have a fix here that should make the log file available: https://github.com/snakemake/snakemake/pull/2545

vsoch commented 7 months ago

okay ran that - where should it be? I don't see anything on the local machine.

 ls .snakemake/storage/s3/snakemake-testing-llnl/logs/
copy.log

And the log referenced:

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-12-15T084605.597382.snakemake.log
WorkflowError:
At least one job did not complete successfully.

.snakemake/log/2023-12-15T084605.597382.snakemake.log is available but just the same content as what I see in the terminal. Is there something in s3 that isn't local?

vsoch commented 7 months ago

Looking at the code - I think there might be! I can't connect to my storage atm (VPN not working) but will check a bit later and update here.

vsoch commented 7 months ago

okay I was able to log into the storage, but I still only see the copy.log (so I don't think it's saving anything). I think I'll ask shamel how to debug this.

vsoch commented 7 months ago

okay I'm going to try first removing all piping into a log (to see if I can see in the console). And then I'm going to live dangerously and set everything up, try adding a sleep, and see if I can shell interactively into an instance.

vsoch commented 7 months ago

I think I see a possible error - the executable was not made executable it seems:


[proxy:0:0@calc-pi-784bed-31d00cde-6651-402c-8320-group0-0-cx12] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:151): execvp error on file .snakemake/storage/s3/snakemake-testing-llnl/pi_MPI (Permission denied)

That's strange, I'll look at the file and make sure it's actually a binary, and then ensure it is made executable.

vsoch commented 7 months ago

okay this particular problem seems fixed - the workflow finishes successfully but I'm debugging an empty output file. Closing -thanks for all the help here!

mbrenner-arbor commented 2 months ago

Can I get a rule to run remotely when one of the inputs/outputs are flagged as local?