Closed vsoch closed 7 months ago
in principle you had the right intuition. I will have to test why this did not work.
Actually, jobs with local input files should automatically become local jobs as well, and not run in the cloud.
So, two things to change here. I will try to do that ASAP.
Fix should be here: https://github.com/snakemake/snakemake/pull/2541
okay I tried the run last night (with the copy step) and the file was 404 in storage - I'll try it again today (I think my computer shut down last night and maybe that was related).
Update (ran again)! Here is the current Snakefile:
# https://github.com/snakemake/snakemake/blob/main/tests/test_slurm_mpi/Snakefile
# Note that in reality, the mpi, account, and partition resources should be specified
# via --default-resources, in order to keep such infrastructure specific details out of the
# workflow definition.
localrules:
all,
clean,
copy,
rule all:
input:
"pi.calc",
rule clean:
shell:
"rm -f pi.calc"
rule copy:
input:
local("pi_MPI.c"),
output:
"pi_MPI.c",
log:
"logs/copy.log",
resources:
mem_mb=0,
shell:
"cp {input} {output} &> {log}"
# TODO need to flag this with a wrapper only
rule compile:
input:
"pi_MPI.c",
output:
"pi_MPI",
log:
"logs/compile.log",
resources:
mem_mb=0,
shell:
"mpicc -o {output} {input} &> {log}"
rule calc_pi:
input:
"pi_MPI",
output:
"pi.calc",
log:
"logs/calc_pi.log",
resources:
mem_mb=0,
tasks=1,
mpi="mpiexec",
shell:
# todo where does ppn go?
"{resources.mpi} -hostfile $BATCH_HOSTS_FILE -n {resources.tasks} {input} 10 > {output} 2> {log}"
I can see in my terminal that it knows to send to storage:
[Mon Dec 11 10:52:52 2023]
localrule copy:
input: pi_MPI.c
output: s3://snakemake-testing-llnl/pi_MPI.c (send to storage)
log: s3://snakemake-testing-llnl/logs/copy.log (send to storage)
jobid: 3
reason: Missing output files: s3://snakemake-testing-llnl/pi_MPI.c (send to storage)
resources: tmpdir=/tmp, mem_mb=0, mem_mib=0
The log is empty, and I don't see the file in storage.
To step back - is there any reason this small file (in my local PWD) would not be uploaded into the working directory context? Why do I need the explicit copy step? And given I have the copy step, why does it seem to report "green" that it's working but the file isn't there? :thinking: Let me know what I might be doing wrong / what we should try next.
Mhm, this is definitely some kind of bug. It will not "automatically" just upload any local files. It would rather automatically turn jobs with local files into local jobs (not running in the remote executor but on the host).
Can you please post the full log? I think it is related to the upload logic for local jobs. There might be a bug.
Can you please post the full log? I think it is related to the upload logic for local jobs. There might be a bug.
The full log for copy? It appears to be an empty file:
cat example/hello-world-intel-mpi/.snakemake/storage/s3/snakemake-testing-llnl/logs/copy.log
# no output
Here is what I see in the terminal:
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3 --default-storage-prefix s3://snakemake-testing-llnl --googlebatch-snippets intel-mpi
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job count
------- -------
all 1
calc_pi 1
compile 1
copy 1
total 4
Select jobs to execute...
Execute 1 jobs...
[Mon Dec 11 10:52:52 2023]
localrule copy:
input: pi_MPI.c
output: s3://snakemake-testing-llnl/pi_MPI.c (send to storage)
log: s3://snakemake-testing-llnl/logs/copy.log (send to storage)
jobid: 3
reason: Missing output files: s3://snakemake-testing-llnl/pi_MPI.c (send to storage)
resources: tmpdir=/tmp, mem_mb=0, mem_mib=0
[Mon Dec 11 10:52:53 2023]
Finished job 3.
1 of 4 steps (25%) done
Select jobs to execute...
Execute 1 jobs...
[Mon Dec 11 10:52:53 2023]
rule compile:
input: s3://snakemake-testing-llnl/pi_MPI.c (retrieve from storage)
output: s3://snakemake-testing-llnl/pi_MPI (send to storage)
log: s3://snakemake-testing-llnl/logs/compile.log (send to storage)
jobid: 2
reason: Missing output files: s3://snakemake-testing-llnl/pi_MPI (send to storage); Input files updated by another job: s3://snakemake-testing-llnl/pi_MPI.c (retrieve from storage)
resources: tmpdir=<TBD>, mem_mb=0, mem_mib=0
🌟️ Setup Command:
export HOME=/root
export PATH=/opt/conda/bin:${PATH}
export LANG=C.UTF-8
export SHELL=/bin/bash
sudo yum update -y
sudo yum install -y wget bzip2 ca-certificates gnupg2 squashfs-tools git
cat <<EOF > ./Snakefile
# https://github.com/snakemake/snakemake/blob/main/tests/test_slurm_mpi/Snakefile
# Note that in reality, the mpi, account, and partition resources should be specified
# via --default-resources, in order to keep such infrastructure specific details out of the
# workflow definition.
localrules:
all,
clean,
copy,
rule all:
input:
"pi.calc",
rule clean:
shell:
"rm -f pi.calc"
rule copy:
input:
local("pi_MPI.c"),
output:
"pi_MPI.c",
log:
"logs/copy.log",
resources:
mem_mb=0,
shell:
"cp {input} {output} &> {log}"
# TODO need to flag this with a wrapper only
rule compile:
input:
"pi_MPI.c",
output:
"pi_MPI",
log:
"logs/compile.log",
resources:
mem_mb=0,
shell:
"mpicc -o {output} {input} &> {log}"
rule calc_pi:
input:
"pi_MPI",
output:
"pi.calc",
log:
"logs/calc_pi.log",
resources:
mem_mb=0,
tasks=1,
mpi="mpiexec",
shell:
# todo where does ppn go?
"{resources.mpi} -hostfile $BATCH_HOSTS_FILE -n {resources.tasks} {input} 10 > {output} 2> {log}"
EOF
cat ./Snakefile
sleep $BATCH_TASK_INDEX
# Note that for this family / image, we are root (do not need sudo)
yum update -y && yum install -y cmake gcc tuned ethtool
# This ONLY works on the hpc-* image family images
google_mpi_tuning --nosmt
# google_install_mpi --intel_mpi
google_install_intelmpi --impi_2021
source /opt/intel/mpi/latest/env/vars.sh
# This is where they are installed to
# ls /opt/intel/mpi/latest/
export PATH=/opt/intel/mpi/latest/bin:$PATH
MPI_LD_PATH=/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${MPI_LD_PATH}
# Only the main job should install conda (rest can use it)
echo "I am batch index ${BATCH_TASK_INDEX}"
export PATH=/opt/conda/bin:${PATH}
if [ $BATCH_TASK_INDEX = 0 ] && [ ! -d "/opt/conda" ] ; then
workdir=$(pwd)
url=https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget ${url} -O ./miniconda.sh
chmod +x ./miniconda.sh
bash ./miniconda.sh -b -u -p /opt/conda
rm -rf ./miniconda.sh
conda config --system --set channel_priority strict
which python
/opt/conda/bin/python --version
url=https://github.com/snakemake/snakemake-interface-common
git clone --depth 1 ${url} /tmp/snakemake-common
cd /tmp/snakemake-common
/opt/conda/bin/python -m pip install .
url=https://github.com/snakemake/snakemake-interface-executor-plugins
git clone --depth 1 ${url} /tmp/snakemake-plugin
cd /tmp/snakemake-plugin
/opt/conda/bin/python -m pip install .
git clone --depth 1 https://github.com/snakemake/snakemake /tmp/snakemake
cd /tmp/snakemake
/opt/conda/bin/python -m pip install .
cd ${workdir}
fi
🐍️ Snakemake Command:
export HOME=/root
export PATH=/opt/conda/bin:${PATH}
export LANG=C.UTF-8
export SHELL=/bin/bash
$(pwd)
ls
which snakemake || whereis snakemake
export PATH=/opt/intel/mpi/latest/bin:$PATH
MPI_LD_PATH=/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${MPI_LD_PATH}
find /opt/intel -name mpicc
# This is important - it won't work without sourcing
source /opt/intel/mpi/latest/env/vars.sh
if [ $BATCH_TASK_INDEX = 0 ]; then
ls
which mpirun
echo "pip install --target '.snakemake/pip-deployments' snakemake-storage-plugin-s3 && python -m snakemake --deploy-sources s3://snakemake-testing-llnl/snakemake-workflow-sources.0031744e6280b8613a91533c9ebe9530ee78e44f17f78854c9e678a1971d4203.tar.xz 0031744e6280b8613a91533c9ebe9530ee78e44f17f78854c9e678a1971d4203 --default-storage-prefix 's3://snakemake-testing-llnl' --default-storage-provider 's3' --storage-s3-retries 5 && python -m snakemake --snakefile 'Snakefile' --target-jobs 'compile:' --allowed-rules 'compile' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=0' 'mem_mib=0' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers mtime params input code software-env --conda-frontend 'mamba' --shared-fs-usage 'none' --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 5 --scheduler 'greedy' --storage-s3-retries 5 --default-storage-prefix 's3://snakemake-testing-llnl' --default-storage-provider 's3' --default-resources 'tmpdir=system_tmpdir' --mode 'remote'"
pip install --target '.snakemake/pip-deployments' snakemake-storage-plugin-s3 && python -m snakemake --deploy-sources s3://snakemake-testing-llnl/snakemake-workflow-sources.0031744e6280b8613a91533c9ebe9530ee78e44f17f78854c9e678a1971d4203.tar.xz 0031744e6280b8613a91533c9ebe9530ee78e44f17f78854c9e678a1971d4203 --default-storage-prefix 's3://snakemake-testing-llnl' --default-storage-provider 's3' --storage-s3-retries 5 && python -m snakemake --snakefile 'Snakefile' --target-jobs 'compile:' --allowed-rules 'compile' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=0' 'mem_mib=0' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers mtime params input code software-env --conda-frontend 'mamba' --shared-fs-usage 'none' --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 5 --scheduler 'greedy' --storage-s3-retries 5 --default-storage-prefix 's3://snakemake-testing-llnl' --default-storage-provider 's3' --default-resources 'tmpdir=system_tmpdir' --mode 'remote'
fi
Job projects/llnl-flux/locations/us-central1/jobs/compile-4c2d49 has state SCHEDULED
STATUS_CHANGED: Job state is set from QUEUED to SCHEDULED for job projects/1040347784593/locations/us-central1/jobs/compile-4c2d49.
Job projects/llnl-flux/locations/us-central1/jobs/compile-4c2d49 has state SCHEDULED
You can safely ignore everything after copy - that is a lot of extra printing for my own FYI as I develop! Also note there are a bunch of these errors all along the way (I wonder if that is related to the file not uploading)?
/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/urllib3/connectionpool.py:1061: InsecureRequestWarning: Unverified HTTPS request is being made to host 'snakemake-testing-llnl.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
I can confirm that I don't see "copy" as a job in batch anymore, so it is running locally. Also I don't know if this is relevant, but I see that file in the logs directory? (I didn't put it there!) :laughing:
$ cat example/hello-world-intel-mpi/.snakemake/storage/s3/snakemake-testing-llnl/
logs/ pi_MPI.c
Ok, I think I have fixed it in the main branch. The storage upload was deactivated for local jobs. Now it should properly upload the files.
Regarding this:
/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/urllib3/connectionpool.py:1061: InsecureRequestWarning: Unverified HTTPS request is being made to host 'snakemake-testing-llnl.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
It is not related, but I would definitely love to fix this.
It is a boto issue: https://github.com/boto/botocore/issues/2630
That worked! We made it through the compile step, and I see pi_MPI in storage. Now it looks like the mpiexec (same workflow above) is failing but I don't see any error why:
The error I see after that is a failure to delete it from storage (but it's an output file so it wasn't generated yet). This is what I see locally, and I can't see any output files generated:
Job projects/llnl-flux/locations/us-central1/jobs/compile-23bd66 has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/compile-23bd66 has state SUCCEEDED
[Tue Dec 12 05:52:34 2023]
Finished job 2.
2 of 4 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Dec 12 05:52:34 2023]
rule calc_pi:
input: s3://snakemake-testing-llnl/pi_MPI (retrieve from storage)
output: s3://snakemake-testing-llnl/pi.calc (send to storage)
log: s3://snakemake-testing-llnl/logs/calc_pi.log (send to storage)
jobid: 1
reason: Missing output files: s3://snakemake-testing-llnl/pi.calc (send to storage); Input files updated by another job: s3://snakemake-testing-llnl/pi_MPI (retrieve from storage)
resources: tmpdir=<TBD>, mem_mb=0, mem_mib=0, tasks=1, mpi=mpiexec
🌟️ Setup Command:
🐍️ Snakemake Command:
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
STATUS_CHANGED: Job state is set from QUEUED to SCHEDULED for job projects/1040347784593/locations/us-central1/jobs/calc-pi-0689fe.
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state SCHEDULED
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe has state FAILED
[Tue Dec 12 06:01:57 2023]
Error in rule calc_pi:
message: Google Batch job 'projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe' failed. For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 1
input: s3://snakemake-testing-llnl/pi_MPI (retrieve from storage)
output: s3://snakemake-testing-llnl/pi.calc (send to storage)
log: s3://snakemake-testing-llnl/logs/calc_pi.log (send to storage), .snakemake/googlebatch_logs/calc_pi.log (check log file(s) for error details)
shell:
mpiexec -hostfile $BATCH_HOSTS_FILE -n 1 .snakemake/storage/s3/snakemake-testing-llnl/pi_MPI 10 > .snakemake/storage/s3/snakemake-testing-llnl/pi.calc 2> .snakemake/storage/s3/snakemake-testing-llnl/logs/calc_pi.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: projects/llnl-flux/locations/us-central1/jobs/calc-pi-0689fe
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-12-12T054309.090988.snakemake.log
WorkflowError:
At least one job did not complete successfully.
I'm guessing mpiexec returned exit code non zero, but it's not clear why (it doesn't give me a log or clear error but just tells me it can't find the output for storage). Is there a good way to debug this?
Indeed mpiexec seems to fail. I would try it locally first, that should be easier to debug. I am by no means an mpi expert, so that I am of little help here. Have you checked that log file you pipe into? (Could be that it is not uploaded right? This is an error I can fix)
I think I have a fix here that should make the log file available: https://github.com/snakemake/snakemake/pull/2545
okay ran that - where should it be? I don't see anything on the local machine.
ls .snakemake/storage/s3/snakemake-testing-llnl/logs/
copy.log
And the log referenced:
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-12-15T084605.597382.snakemake.log
WorkflowError:
At least one job did not complete successfully.
.snakemake/log/2023-12-15T084605.597382.snakemake.log is available but just the same content as what I see in the terminal. Is there something in s3 that isn't local?
Looking at the code - I think there might be! I can't connect to my storage atm (VPN not working) but will check a bit later and update here.
okay I was able to log into the storage, but I still only see the copy.log (so I don't think it's saving anything). I think I'll ask shamel how to debug this.
okay I'm going to try first removing all piping into a log (to see if I can see in the console). And then I'm going to live dangerously and set everything up, try adding a sleep, and see if I can shell interactively into an instance.
I think I see a possible error - the executable was not made executable it seems:
[proxy:0:0@calc-pi-784bed-31d00cde-6651-402c-8320-group0-0-cx12] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:151): execvp error on file .snakemake/storage/s3/snakemake-testing-llnl/pi_MPI (Permission denied)
That's strange, I'll look at the file and make sure it's actually a binary, and then ensure it is made executable.
okay this particular problem seems fixed - the workflow finishes successfully but I'm debugging an empty output file. Closing -thanks for all the help here!
Can I get a rule to run remotely when one of the inputs/outputs are flagged as local?
@johanneskoester I was able to get my workflow to run by designating the pi_MPI.c as local, e.g., the input here:
but then it doesn't seem to upload to storage and down to the worker (and is not present in my working directory):
I tried adding a copy step, but I don't think it worked:
I thought that might get me a little further, but it still couldn't find it:
To step back - how do I tell snakemake to take my local file, put it in storage, and then use it for this step?