snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

Invalid --distribution specification? #103

Closed brantfaircloth closed 5 months ago

brantfaircloth commented 5 months ago

Good afternoon,

I'm using the snakemake-executor-plugin-slurm with a snakemake workflow meant for calling SNPs in genomic data (https://github.com/harvardinformatics/snpArcher). When snakemake attempts to submit jobs, those submissions are failing with the error message:

SLURM job submission failed. The error message was sbatch: error: Invalid --distribution specification

I've pored over the actual command being submitted (below) but cannot find why this particular error is being thrown - and a pointer to track this down would be super helpful. I've searched existing issues here, on the snakemake issues page, and also on the snparcher issues page, but I haven't tracked down anything similar or anything helpful (yet).

The version of slurm is 23.11.6. Happy to provide any additional information, as well. I realize this is less likely a bug and more likely something to do with how our university HPC is setup.

Thanks much, -brant

# The offending sbatch call:

sbatch --job-name 8cf30205-818c-4a01-8c15-ecf5ebe02650 --output /ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/slurm_logs/rule_download_reference/GCA_019023105.1_LSU_DiBr_2.0_genomic.fna/%j.log --export=ALL --comment rule_download_reference_wildcards_GCA_019023105.1_LSU_DiBr_2.0_genomic.fna -A 'hpc_deepbayou' -p single -t 720 --mem 4000 --ntasks=1 --cpus-per-task=1 -D /ddnA/work/brant/snpArcher-test/projects/anna-test --wrap="/project/brant/db-home/miniconda/envs/snparcher/bin/python3.11 -m snakemake --snakefile /ddnA/work/brant/snpArcher-test/snpArcher/workflow/Snakefile --target-jobs 'download_reference:refGenome=GCA_019023105.1_LSU_DiBr_2.0_genomic.fna' --allowed-rules 'download_reference' --cores all --attempt 1 --force-use-threads  --resources 'mem_mb=4000' 'mem_mib=3815' 'disk_mb=1000' 'disk_mib=954' 'mem_mb_reduced=3600' --wait-for-files '/ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/tmp.x2lj0io7' '/home/brant/work/snpArcher-test/projects/anna-test/reference' '/ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/conda/8ecf006a88f493174cca4b84629295d3_' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers input params mtime code software-env --deployment-method conda --conda-frontend mamba --conda-base-path /project/brant/db-home/miniconda --apptainer-prefix /work/brant/.singularity/ --shared-fs-usage persistence software-deployment input-output sources source-cache storage-local-copies --wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ --latency-wait 100 --scheduler ilp --local-storage-prefix .snakemake/storage --scheduler-solver-path /project/brant/db-home/miniconda/envs/snparcher/bin --set-threads base64//ZG93bmxvYWRfcmVmZXJlbmNlPTE= base64//aW5kZXhfcmVmZXJlbmNlPTE= base64//Zm9ybWF0X2ludGVydmFsX2xpc3Q9MQ== base64//Y3JlYXRlX2d2Y2ZfaW50ZXJ2YWxzPTE= base64//Y3JlYXRlX2RiX2ludGVydmFscz0x base64//cGljYXJkX2ludGVydmFscz0x base64//Z2VubWFwPTEy base64//bWFwcGFiaWxpdHlfYmVkPTE= base64//Z2V0X2Zhc3RxX3BlPTEy base64//ZmFzdHA9MTI= base64//YndhX21hcD0xMg== base64//ZGVkdXA9MTI= base64//bWVyZ2VfYmFtcz0x base64//YmFtMmd2Y2Y9MQ== base64//Y29uY2F0X2d2Y2ZzPTE= base64//YmNmdG9vbHNfbm9ybT0x base64//Y3JlYXRlX2RiX21hcGZpbGU9MQ== base64//Z3ZjZjJEQj0x base64//REIydmNmPTE= base64//ZmlsdGVyVmNmcz0x base64//c29ydF9nYXRoZXJWY2ZzPTE= base64//Y29tcHV0ZV9kND0x base64//Y3JlYXRlX2Nvdl9iZWQ9MQ== base64//bWVyZ2VfZDQ9MQ== base64//YmFtX3N1bXN0YXRzPTE= base64//Y29sbGVjdF9jb3ZzdGF0cz0x base64//Y29sbGVjdF9mYXN0cF9zdGF0cz0x base64//Y29sbGVjdF9zdW1zdGF0cz0x base64//cWNfYWRtaXh0dXJlPTE= base64//cWNfY2hlY2tfZmFpPTE= base64//cWNfZ2VuZXJhdGVfY29vcmRzX2ZpbGU9MQ== base64//cWNfcGxpbms9MQ== base64//cWNfcWNfcGxvdHM9MQ== base64//cWNfc2V0dXBfYWRtaXh0dXJlPTE= base64//cWNfc3Vic2FtcGxlX3NucHM9MQ== base64//cWNfdmNmdG9vbHNfaW5kaXZpZHVhbHM9MQ== base64//bWtfZGVnZW5vdGF0ZT0x base64//bWtfcHJlcF9nZW5vbWU9MQ== base64//bWtfc3BsaXRfc2FtcGxlcz0x base64//cG9zdHByb2Nlc3Nfc3RyaWN0X2ZpbHRlcj0x base64//cG9zdHByb2Nlc3NfYmFzaWNfZmlsdGVyPTE= base64//cG9zdHByb2Nlc3NfZmlsdGVyX2luZGl2aWR1YWxzPTE= base64//cG9zdHByb2Nlc3Nfc3Vic2V0X2luZGVscz0x base64//cG9zdHByb2Nlc3Nfc3Vic2V0X3NucHM9MQ== base64//cG9zdHByb2Nlc3NfdXBkYXRlX2JlZD0x base64//dHJhY2todWJfYmNmdG9vbHNfZGVwdGg9MQ== base64//dHJhY2todWJfYmVkZ3JhcGhfdG9fYmlnd2lnPTE= base64//dHJhY2todWJfY2FsY19waT0x base64//dHJhY2todWJfY2FsY19zbnBkZW49MQ== base64//dHJhY2todWJfY2FsY190YWppbWE9MQ== base64//dHJhY2todWJfY2hyb21fc2l6ZXM9MQ== base64//dHJhY2todWJfY29udmVydF90b19iZWRncmFwaD0x base64//dHJhY2todWJfc3RyaXBfdmNmPTE= base64//dHJhY2todWJfdmNmdG9vbHNfZnJlcT0x base64//dHJhY2todWJfd3JpdGVfaHViX2ZpbGVzPTE= base64//c2VudGllb25fbWFwPTE= base64//c2VudGllb25fZGVkdXA9MQ== base64//c2VudGllb25faGFwbG90eXBlcj0x base64//c2VudGllb25fY29tYmluZV9ndmNmPTE= base64//c2VudGllb25fYmFtX3N0YXRzPTE= --default-resources base64//bWVtX21iPWF0dGVtcHQgKiA0MDAw base64//ZGlza19tYj1tYXgoMippbnB1dC5zaXplX21iLCAxMDAwKQ== base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= base64//bWVtX21iX3JlZHVjZWQ9KGF0dGVtcHQgKiA0MDAwKSAqIDAuOQ== base64//c2x1cm1fcGFydGl0aW9uPXNpbmdsZQ== base64//c2x1cm1fYWNjb3VudD1ocGNfZGVlcGJheW91 base64//cnVudGltZT03MjA= --executor slurm-jobstep --jobs 1 --mode remote"
cmeesters commented 5 months ago

Hi,

edit: I noticed that you submitted on the head node, like intended. So, could you please run

$ sbatch test.sh

with test.sh being

#!/bin/bash
#SBATCH --job-name 8cf30205-818c-4a01-8c15-ecf5ebe02650
#SBATCH --output /ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/slurm_logs/rule_download_reference/GCA_019023105.1_LSU_DiBr_2.0_genomic.fna/%j.log 
#SBATCH --export=ALL 
#SBATCH --comment rule_download_reference_wildcards_GCA_019023105.1_LSU_DiBr_2.0_genomic.fna 
#SBATCH -A 'hpc_deepbayou' -p single -t 720 
#SBATCH --mem 4000 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1

srun "Hello from $(hostname)"

please? You will notice: This is the exact submission command you implicitly used by running Snakemake.

I am curious to find whether it throws the same error.

brantfaircloth commented 5 months ago

Hi Christian,

Thanks for your help. I forgot to add, above, that I'm running on the head node at the moment because that's how our HPC staff preferred to test workflow engines like snakemake.

The command seems to run - the output to the specified log file:

srun: lua: Submitted job 77901
slurmstepd: error: execve(): Hello from db002: No such file or directory
srun: error: db002: task 0: Exited with exit code 2

-b

brantfaircloth commented 5 months ago

After playing around with this a bit, it seems that, for whatever reason, the python module import command -m enclosed in the --wrap="{stuff}" argument is being interpreted as the sbatch -m option, which is the shortcut version of the sbatch --distribution option - and causing the error.

cmeesters commented 5 months ago

arrgh, of course I meant srun echo "Hello ...", but we can ignore this error.

OK, it works. Just, what is srun: lua: Submitted job ...? You did start with sbatch, did you?

Anyway, --distribution is a SLURM specific speciality to use for MPI programs with non-ordinary rank topology. I have no idea, where this error of yours gets provoked. --wrap was chosen because otherwise we need tricky, error-prone ad hoc jobscripts.

My latest version of SLURM is 23.02.7 - I will ask some contacts to carry out tests. This error is highly disturbing ...

cmeesters commented 5 months ago

PS How did you get to your conclusion?

brantfaircloth commented 5 months ago

haha - yeah, I thought about prettying it up, but the output indicated it worked either way. I did submit w/ sbatch - I'm not sure about the lua aspect of sbatch - that's something that's showed up recently with upgrades to our queuing system. I think it relates to the lua job submit plugin (https://slurm.schedmd.com/job_submit_plugins.html).

That said, I wonder if there is a bug in that submit plugin that is altering the way that "--wrap" should be functioning. I'll see if it's possible to turn off that plugin for a test.

As for the -m, a simple test like this:

sbatch --job-name c1bc406d-e80f-444e-bb1a-91364f7e84a3 --output /ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/slurm_logs/rule_download_reference/GCA_019023105.1_LSU_DiBr_2.0_genomic.fna/%j.log --export=ALL --comment rule_download_reference_wildcards_GCA_019023105.1_LSU_DiBr_2.0_genomic.fna -A 'hpc_deepbayou' -p single -t 720 --mem 4000 --ntasks=1 --cpus-per-task=1 -D /ddnA/work/brant/snpArcher-test/projects/anna-test \
--wrap="/project/brant/db-home/miniconda/envs/snparcher/bin/python3.11"

submits and runs without error (although it doesn't do anything):

sbatch: Job estimates 12.00 SUs for -p single --nodes=1 --ntasks=1 --cpus-per-task=1
sbatch: lua: Submitted job 77907
Submitted batch job 77907

while:

sbatch --job-name c1bc406d-e80f-444e-bb1a-91364f7e84a3 --output /ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/slurm_logs/rule_download_reference/GCA_019023105.1_LSU_DiBr_2.0_genomic.fna/%j.log --export=ALL --comment rule_download_reference_wildcards_GCA_019023105.1_LSU_DiBr_2.0_genomic.fna -A 'hpc_deepbayou' -p single -t 720 --mem 4000 --ntasks=1 --cpus-per-task=1 -D /ddnA/work/brant/snpArcher-test/projects/anna-test \
--wrap="/project/brant/db-home/miniconda/envs/snparcher/bin/python3.11 -m snakemake"

produces the error that I've been seeing:

sbatch: error: Invalid --distribution specification
error
cmeesters commented 5 months ago

Thanks for your detailed feedback: I will test a subtle change. (Doubt it will work.)

brantfaircloth commented 5 months ago

I'm also chatting w/ our sysadmin who works on slurm to see if he has any suggestions/fixes.

brantfaircloth commented 5 months ago

Hi Christian,

I think that we may have found the culprit - there was a site customization to the sbatch command that is/was stripping the quotes from around the items passed to --wrap. That caused sbatch to interpret the string as sbatch options rather than as a command to be wrapped in a shell script. Going to do some testing to confirm.

brantfaircloth commented 5 months ago

Yep, that did the trick. Thanks for your help and apologies for the bother! At the very least, if someone else hits the same issue, this could be a fix.

-b

cmeesters commented 5 months ago

Oh? That was rather fast. Sorry, for me, it was dinner time and today the kids had little for lunch, so no more work for today.

One more thing, though: There is no need to apologize, bugs do happen, and sometimes it's hard to find the real reason. I am, however, rather interested in learning the source and the remedy. Other than that, I am just glad it is working for you, now.

brantfaircloth commented 5 months ago

Running perfectly now. Thanks again and have a good evening,

-b