harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Invalid --distribution specification? #202

Closed brantfaircloth closed 6 days ago

brantfaircloth commented 1 week ago

Hi y'all,

Working to get snpArcher running on our HPC and have bumped up against a problem with batch job submission through snakemake-executor-plugin-slurm. The issue is that sbatch commands fail with a somewhat cryptic error that I can't track down:

SLURM job submission failed. The error message was sbatch: error: Invalid --distribution specification

I've submitted an issue upstream to the snakemake-executor-plugin-slurm crew to see if they have any suggestions and will post back if I get it sorted.

Thanks, -brant

cademirch commented 1 week ago

That's a new one to me. Right idea opening an issue there too. Curious if you're able to run any workflow using the plugin? Here is a simple one to test:

Snakefile:

rule all:
    input: expand("test_output/hi_{i}.txt", i=range(4))

rule a:
    output: 
        "test_output/hi_{i}.txt"
    shell:
        """
        echo {wildcards.i} > {output}
        """

Other info that would be helpful is how you ran/submitted the actual Snakemake run itself?

brantfaircloth commented 1 week ago

It's a little of a long story, but working with HPC staff to find the way they prefer jobs are submitted. At the moment [on their advice as we test] submitting on the head node and monitoring. The call to run is/was:

snakemake --verbose -s snpArcher/workflow/Snakefile -d projects/anna-test --workflow-profile snpArcher/profiles/slurm

As I posted in the issue for the plugin, the offending sbatch call is:

sbatch --job-name 8cf30205-818c-4a01-8c15-ecf5ebe02650 --output /ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/slurm_logs/rule_download_reference/GCA_019023105.1_LSU_DiBr_2.0_genomic.fna/%j.log --export=ALL --comment rule_download_reference_wildcards_GCA_019023105.1_LSU_DiBr_2.0_genomic.fna -A 'hpc_deepbayou' -p single -t 720 --mem 4000 --ntasks=1 --cpus-per-task=1 -D /ddnA/work/brant/snpArcher-test/projects/anna-test --wrap="/project/brant/db-home/miniconda/envs/snparcher/bin/python3.11 -m snakemake --snakefile /ddnA/work/brant/snpArcher-test/snpArcher/workflow/Snakefile --target-jobs 'download_reference:refGenome=GCA_019023105.1_LSU_DiBr_2.0_genomic.fna' --allowed-rules 'download_reference' --cores all --attempt 1 --force-use-threads  --resources 'mem_mb=4000' 'mem_mib=3815' 'disk_mb=1000' 'disk_mib=954' 'mem_mb_reduced=3600' --wait-for-files '/ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/tmp.x2lj0io7' '/home/brant/work/snpArcher-test/projects/anna-test/reference' '/ddnA/work/brant/snpArcher-test/projects/anna-test/.snakemake/conda/8ecf006a88f493174cca4b84629295d3_' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers input params mtime code software-env --deployment-method conda --conda-frontend mamba --conda-base-path /project/brant/db-home/miniconda --apptainer-prefix /work/brant/.singularity/ --shared-fs-usage persistence software-deployment input-output sources source-cache storage-local-copies --wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ --latency-wait 100 --scheduler ilp --local-storage-prefix .snakemake/storage --scheduler-solver-path /project/brant/db-home/miniconda/envs/snparcher/bin --set-threads base64//ZG93bmxvYWRfcmVmZXJlbmNlPTE= base64//aW5kZXhfcmVmZXJlbmNlPTE= base64//Zm9ybWF0X2ludGVydmFsX2xpc3Q9MQ== base64//Y3JlYXRlX2d2Y2ZfaW50ZXJ2YWxzPTE= base64//Y3JlYXRlX2RiX2ludGVydmFscz0x base64//cGljYXJkX2ludGVydmFscz0x base64//Z2VubWFwPTEy base64//bWFwcGFiaWxpdHlfYmVkPTE= base64//Z2V0X2Zhc3RxX3BlPTEy base64//ZmFzdHA9MTI= base64//YndhX21hcD0xMg== base64//ZGVkdXA9MTI= base64//bWVyZ2VfYmFtcz0x base64//YmFtMmd2Y2Y9MQ== base64//Y29uY2F0X2d2Y2ZzPTE= base64//YmNmdG9vbHNfbm9ybT0x base64//Y3JlYXRlX2RiX21hcGZpbGU9MQ== base64//Z3ZjZjJEQj0x base64//REIydmNmPTE= base64//ZmlsdGVyVmNmcz0x base64//c29ydF9nYXRoZXJWY2ZzPTE= base64//Y29tcHV0ZV9kND0x base64//Y3JlYXRlX2Nvdl9iZWQ9MQ== base64//bWVyZ2VfZDQ9MQ== base64//YmFtX3N1bXN0YXRzPTE= base64//Y29sbGVjdF9jb3ZzdGF0cz0x base64//Y29sbGVjdF9mYXN0cF9zdGF0cz0x base64//Y29sbGVjdF9zdW1zdGF0cz0x base64//cWNfYWRtaXh0dXJlPTE= base64//cWNfY2hlY2tfZmFpPTE= base64//cWNfZ2VuZXJhdGVfY29vcmRzX2ZpbGU9MQ== base64//cWNfcGxpbms9MQ== base64//cWNfcWNfcGxvdHM9MQ== base64//cWNfc2V0dXBfYWRtaXh0dXJlPTE= base64//cWNfc3Vic2FtcGxlX3NucHM9MQ== base64//cWNfdmNmdG9vbHNfaW5kaXZpZHVhbHM9MQ== base64//bWtfZGVnZW5vdGF0ZT0x base64//bWtfcHJlcF9nZW5vbWU9MQ== base64//bWtfc3BsaXRfc2FtcGxlcz0x base64//cG9zdHByb2Nlc3Nfc3RyaWN0X2ZpbHRlcj0x base64//cG9zdHByb2Nlc3NfYmFzaWNfZmlsdGVyPTE= base64//cG9zdHByb2Nlc3NfZmlsdGVyX2luZGl2aWR1YWxzPTE= base64//cG9zdHByb2Nlc3Nfc3Vic2V0X2luZGVscz0x base64//cG9zdHByb2Nlc3Nfc3Vic2V0X3NucHM9MQ== base64//cG9zdHByb2Nlc3NfdXBkYXRlX2JlZD0x base64//dHJhY2todWJfYmNmdG9vbHNfZGVwdGg9MQ== base64//dHJhY2todWJfYmVkZ3JhcGhfdG9fYmlnd2lnPTE= base64//dHJhY2todWJfY2FsY19waT0x base64//dHJhY2todWJfY2FsY19zbnBkZW49MQ== base64//dHJhY2todWJfY2FsY190YWppbWE9MQ== base64//dHJhY2todWJfY2hyb21fc2l6ZXM9MQ== base64//dHJhY2todWJfY29udmVydF90b19iZWRncmFwaD0x base64//dHJhY2todWJfc3RyaXBfdmNmPTE= base64//dHJhY2todWJfdmNmdG9vbHNfZnJlcT0x base64//dHJhY2todWJfd3JpdGVfaHViX2ZpbGVzPTE= base64//c2VudGllb25fbWFwPTE= base64//c2VudGllb25fZGVkdXA9MQ== base64//c2VudGllb25faGFwbG90eXBlcj0x base64//c2VudGllb25fY29tYmluZV9ndmNmPTE= base64//c2VudGllb25fYmFtX3N0YXRzPTE= --default-resources base64//bWVtX21iPWF0dGVtcHQgKiA0MDAw base64//ZGlza19tYj1tYXgoMippbnB1dC5zaXplX21iLCAxMDAwKQ== base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= base64//bWVtX21iX3JlZHVjZWQ9KGF0dGVtcHQgKiA0MDAwKSAqIDAuOQ== base64//c2x1cm1fcGFydGl0aW9uPXNpbmdsZQ== base64//c2x1cm1fYWNjb3VudD1ocGNfZGVlcGJheW91 base64//cnVudGltZT03MjA= --executor slurm-jobstep --jobs 1 --mode remote"

and the slurm profile I'm applying is pretty close to stock - see below.

executor: slurm
use-conda: True
jobs: 15 # Have up to N jobs submitted at any given time
latency-wait: 100 # Wait N seconds for output files due to latency
retries: 0 # Retry jobs N times.

# These resources will be applied to all rules. Can be overriden on a per-rule basis below.
default-resources:
  mem_mb: attempt * 4000
  mem_mb_reduced: (attempt * 4000) * 0.9 # Mem allocated to java for GATK rules (tries to prevent OOM errors)
  slurm_partition: "single"
  slurm_account: "hpc_deepbayou" # Same as sbatch -A. Not all clusters use this.
  runtime: 720 # In minutes

# Control number of threads each rule will use.
set-threads:
  # Reference Genome Processing. Does NOT use more than 1 thread.
  download_reference: 1
  index_reference: 1
  # Interval Generation. Does NOT use more than 1 thread.
  format_interval_list: 1
  create_gvcf_intervals: 1
  create_db_intervals: 1
  picard_intervals: 1
  # Mappability
  genmap: 12 # Can use more than 1 thread
  mappability_bed: 1 # Does NOT use more than 1 thread
  # Fastq Processing. Can use more than 1 thread.
  get_fastq_pe: 12
  fastp: 12
  # Alignment. Can use more than 1 thread, except merge_bams.
  bwa_map: 12
  dedup: 12
  merge_bams: 1 # Does NOT use more than 1 thread.
  # GVCF
  bam2gvcf: 1 # Should be run with no more than 2 threads.
  concat_gvcfs: 1 # Does NOT use more than 1 thread.
  bcftools_norm: 1 # Does NOT use more than 1 thread.
  create_db_mapfile: 1 # Does NOT use more than 1 thread.
  gvcf2DB: 1 # Should be run with no more than 2 threads.
  # VCF
  DB2vcf: 1 # Should be run with no more than 2 threads.
  filterVcfs: 1 # Should be run with no more than 2 threads.
  sort_gatherVcfs: 1 # Should be run with no more than 2 threads.
  # Callable Bed
  compute_d4: 1 # Can use more than 1 thread
  create_cov_bed: 1 # Does NOT use more than 1 thread.
  merge_d4: 1 # Does NOT use more than 1 thread.
  # Summary Stats Does NOT use more than 1 thread.
  bam_sumstats: 1
  collect_covstats: 1
  collect_fastp_stats: 1
  collect_sumstats: 1
  # QC Module Does NOT use more than 1 thread.
  qc_admixture: 1
  qc_check_fai: 1
  qc_generate_coords_file: 1
  qc_plink: 1
  qc_qc_plots: 1
  qc_setup_admixture: 1
  qc_subsample_snps: 1
  qc_vcftools_individuals: 1
  # MK Module Does NOT use more than 1 thread.
  mk_degenotate: 1
  mk_prep_genome: 1
  mk_split_samples: 1
  # Postprocess Module Does NOT use more than 1 thread.
  postprocess_strict_filter: 1
  postprocess_basic_filter: 1
  postprocess_filter_individuals: 1
  postprocess_subset_indels: 1
  postprocess_subset_snps: 1
  postprocess_update_bed: 1
  # Trackhub Module Does NOT use more than 1 thread.
  trackhub_bcftools_depth: 1
  trackhub_bedgraph_to_bigwig: 1
  trackhub_calc_pi: 1
  trackhub_calc_snpden: 1
  trackhub_calc_tajima: 1
  trackhub_chrom_sizes: 1
  trackhub_convert_to_bedgraph: 1
  trackhub_strip_vcf: 1
  trackhub_vcftools_freq: 1
  trackhub_write_hub_files: 1
  # Sentieon Tools. Can use more than 1 thread, except sentieon_bam_stats.
  sentieon_map: 1
  sentieon_dedup: 1
  sentieon_haplotyper: 1
  sentieon_combine_gvcf: 1
  sentieon_bam_stats: 1 # Does NOT use more than 1 thread.

-b

brantfaircloth commented 6 days ago

So, we were able to figure it out - basically, a site optimization for the sbatch command was stripping the quotes from the options passed to sbatch --wrap. This, in turn, caused the commands passed to --wrap to be interpreted as options/arguments to sbatch - and the first of these was the -m option to import the snakemake module. In this instance, -m was interpreted as sbatch -m, which has to do with distribution method for processes to nodes... causing the error.