Jobs fail to submit before snakemake fails

Hello!

This may be more of a core snakemake issue than a snpArcher one, so I'll post an issue there as well.

Basically, the pipeline seems to run fine up until the gvcf2DB step, after which job submission appears to hang for several hours (on Select jobs to execute...) before resuming, effectively submitting jobs in large, non-continuous chunks.

This continued until the snakemake processed died after more than 12 hours of being stuck on Select jobs to execute... with this error message:

Traceback (most recent call last):

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/snakemake/cli.py", line 2078, in args_to_api
    dag_api.execute_workflow(

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/snakemake/api.py", line 589, in execute_workflow
    workflow.execute(

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/snakemake/workflow.py", line 1247, in execute
    raise e

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/snakemake/workflow.py", line 1243, in execute
    success = self.scheduler.schedule()
              ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/snakemake/scheduler.py", line 279, in schedule
    run = self.job_selector(needrun)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/snakemake/scheduler.py", line 603, in job_selector_ilp
    self._solve_ilp(prob)

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/snakemake/scheduler.py", line 655, in _solve_ilp
    prob.solve(solver)

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/pulp/pulp.py", line 1883, in solve
    status = solver.actualSolve(self, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/pulp/apis/coin_api.py", line 112, in actualSolve
    return self.solve_CBC(lp, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/pulp/apis/coin_api.py", line 128, in solve_CBC
    vs, variablesNames, constraintsNames, objectiveName = lp.writeMPS(
                                                          ^^^^^^^^^^^^

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/pulp/pulp.py", line 1748, in writeMPS
    return mpslp.writeMPS(
           ^^^^^^^^^^^^^^^

  File "/lustre2/home/nt246_0001/ari22/miniforge3/envs/snparcher/lib/python3.11/site-packages/pulp/mps_lp.py", line 229, in writeMPS
    for v, value in c.items():

KeyError: job_2548

Based on the error message I believe the issue might be related to job scheduling with the ILP solver, as during the run it seems to have flipped to using the "greedy" solver several times (I've attached the log, which is a resumption of a partially complete run). I'll see if changing the --scheduler flag to greedy will help, but wanted to put this on your radar in case others run into a similar issue.

I've attached my snakemake log and SLURM profile in case those are helpful.

Thank you! snparcher_wc_ec_20240612.zip

SLURM profile:

executor: slurm
use-conda: True
jobs: 150 # Have up to N jobs submitted at any given time
latency-wait: 180 # Wait N seconds for output files due to latency
retries: 3 # Retry jobs N times.

# These resources will be applied to all rules. Can be overriden on a per-rule basis below.
default-resources:
  mem_mb: attempt * 4000
  mem_mb_reduced: (attempt * 4000) * 0.9 # Mem allocated to java for GATK rules (tries to prevent OOM errors)
  slurm_partition: "regular,long7,long30"
  slurm_account: "nt246_0001" # Same as sbatch -A. Not all clusters use this.
  runtime: # In minutes

# Control number of threads each rule will use.
set-threads:
  # Reference Genome Processing. Does NOT use more than 1 thread.
  download_reference: 1
  index_reference: 1
  # Interval Generation. Does NOT use more than 1 thread.
  format_interval_list: 1
  create_gvcf_intervals: 1
  create_db_intervals: 1
  picard_intervals: 1
  # Mappability
  genmap: 2 # Can use more than 1 thread
  mappability_bed: 1 # Does NOT use more than 1 thread
  # Fastq Processing. Can use more than 1 thread.
  get_fastq_pe: 1
  fastp: 1
  # Alignment. Can use more than 1 thread, except merge_bams.
  bwa_map: 4
  dedup: 4
  merge_bams: 1 # Does NOT use more than 1 thread.
  # GVCF
  bam2gvcf: 2 # Should be run with no more than 2 threads.
  concat_gvcfs: 1 # Does NOT use more than 1 thread.
  bcftools_norm: 1 # Does NOT use more than 1 thread.
  create_db_mapfile: 1 # Does NOT use more than 1 thread.
  gvcf2DB: 2 # Should be run with no more than 2 threads.
  # VCF
  DB2vcf: 2 # Should be run with no more than 2 threads.
  filterVcfs: 2 # Should be run with no more than 2 threads.
  sort_gatherVcfs: 2 # Should be run with no more than 2 threads.
  # Callable Bed
  compute_d4: 1 # Can use more than 1 thread
  create_cov_bed: 1 # Does NOT use more than 1 thread.
  merge_d4: 1 # Does NOT use more than 1 thread.
  # Summary Stats Does NOT use more than 1 thread.
  bam_sumstats: 1
  collect_covstats: 1
  collect_fastp_stats: 1
  collect_sumstats: 1
  # QC Module Does NOT use more than 1 thread.
  qc_admixture: 1
  qc_check_fai: 1
  qc_generate_coords_file: 1
  qc_plink: 1
  qc_qc_plots: 1
  qc_setup_admixture: 1
  qc_subsample_snps: 1
  qc_vcftools_individuals: 1
  # MK Module Does NOT use more than 1 thread.
  mk_degenotate: 1
  mk_prep_genome: 1
  mk_split_samples: 1
  # Postprocess Module Does NOT use more than 1 thread.
  postprocess_strict_filter: 1
  postprocess_basic_filter: 1
  postprocess_filter_individuals: 1
  postprocess_subset_indels: 1
  postprocess_subset_snps: 1
  postprocess_update_bed: 1
  # Trackhub Module Does NOT use more than 1 thread.
  trackhub_bcftools_depth: 1
  trackhub_bedgraph_to_bigwig: 1
  trackhub_calc_pi: 1
  trackhub_calc_snpden: 1
  trackhub_calc_tajima: 1
  trackhub_chrom_sizes: 1
  trackhub_convert_to_bedgraph: 1
  trackhub_strip_vcf: 1
  trackhub_vcftools_freq: 1
  trackhub_write_hub_files: 1
  # Sentieon Tools. Can use more than 1 thread, except sentieon_bam_stats.
  sentieon_map: 1
  sentieon_dedup: 1
  sentieon_haplotyper: 1
  sentieon_combine_gvcf: 1
  sentieon_bam_stats: 1 # Does NOT use more than 1 thread.

# Control other resources used by each rule.
set-resources:
  #   # Reference Genome Processing
  #   copy_reference:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   download_reference:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   index_reference:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   # Interval Generation
  #   format_interval_list:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   create_gvcf_intervals:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   create_db_intervals:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   picard_intervals:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   # Mappability
  #   genmap:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   mappability_bed:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   # Fastq Processing
  #   get_fastq_pe:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   fastp:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   # Alignment
  bwa_map:
    mem_mb: attempt * 4000
    slurm_partition: "long7,long30"
    # runtime:
    # cpus_per_task:
  #   dedup:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   merge_bams:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   # GVCF
  #   bam2gvcf: # HaplotypeCaller
  #     mem_mb: attempt * 2000
  #     mem_mb_reduced: (attempt * 2000) * 0.9 # Mem allocated to java (tries to prevent OOM errors)
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task: # Mem allocated to the snakemake job
  #   concat_gvcfs:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   bcftools_norm:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   create_db_mapfile:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   gvcf2DB: # GenomicsDBImport
  #     mem_mb: attempt * 2000
  #     mem_mb_reduced: (attempt * 2000) * 0.9 # Mem allocated to java (tries to prevent OOM errors)
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   # VCF
  #   DB2vcf: # GenotypeGVCFs
  #     mem_mb: attempt * 2000
  #     mem_mb_reduced: (attempt * 2000) * 0.9 # Mem allocated to java (tries to prevent OOM errors)
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   filterVcfs:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   sort_gatherVcfs:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   # Callable Bed
  #   compute_d4:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  create_cov_bed:
    mem_mb: attempt * 4000
    slurm_partition: "long7"
  #     runtime:
  #     cpus_per_task:
  #   merge_d4:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:

  #   # Summary Stats
  #   bam_sumstats:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   collect_covstats:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  #   collect_fastp_stats:
  #     mem_mb: attempt * 2000
  #     slurm_partition:
  #     runtime:
  #     cpus_per_task:
  collect_sumstats:
    mem_mb: attempt * 8000
    slurm_partition: "long7,long30"
#     runtime:
#     cpus_per_task:

#   # QC Module
#   qc_admixture:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   qc_check_fai:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   qc_generate_coords_file:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   qc_plink:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   qc_qc_plots:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   qc_setup_admixture:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   qc_subsample_snps:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   qc_vcftools_individuals:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:

#   # MK Module
#   mk_degenotate:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   mk_prep_genome:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   mk_split_samples:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:

#   # Postprocess Module
#   postprocess_strict_filter:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   postprocess_basic_filter:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   postprocess_filter_individuals:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   postprocess_subset_indels:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   postprocess_subset_snps:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   postprocess_update_bed:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:

#   # Trackhub Module
#   trackhub_bcftools_depth:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_bedgraph_to_bigwig:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_calc_pi:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_calc_snpden:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_calc_tajima:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_chrom_sizes:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_convert_to_bedgraph:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_strip_vcf:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_vcftools_freq:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   trackhub_write_hub_files:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:

#   # Sentieon Tools
#   sentieon_map:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   sentieon_dedup:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   sentieon_haplotyper:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   sentieon_combine_gvcf:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:
#   sentieon_bam_stats:
#     mem_mb: attempt * 2000
#     slurm_partition:
#     runtime:
#     cpus_per_task:

harvardinformatics / snpArcher

Jobs fail to submit before snakemake fails #206