hillerlab / make_lastz_chains

Portable solution to generate genome alignment chains using lastz
MIT License
49 stars 8 forks source link

speeding up/parallelizing make_chains.py with Slurm: NOTE: Error submitting process 'execute_jobs (##)' for execution -- Execution is retried #58

Open SomePersonSomeWhereInTheWorld opened 6 months ago

SomePersonSomeWhereInTheWorld commented 6 months ago

I'm trying to help a researcher speed up make_chains.py results for a mammal. Using this closed issue regarding parallelization, as inspiration, we'd like to speed up the process via our Slurm cluster running RHEL 8. I tried requesting a node via an interactive srun session and starting with 16 CPU with --ntasks and -c. Using --executor local as suggested in the closed thread was painfully slow. The user there mention --cluster_parameters but that results in: make_chains.py: error: unrecognized arguments: --cluster_parameters cpus=16

./make_chains.py MesAur_chr_folded mm10  /path/to/me/make_lastz_chains/MesAur_chr_folded.2bit /path/to/me/make_lastz_chains/mm10.2bit --pd test_out_1 -f --chaining_memory 16   --cluster_executor slurm 
# Make Lastz Chains #
Version 2.0.8
Commit: 187e313afc10382fe44c96e47f27c4466d63e114
Branch: main

* found run_lastz.py at /path/to/me/make_lastz_chains/standalone_scripts/run_lastz.py
* found run_lastz_intermediate_layer.py at /path/to/me/make_lastz_chains/standalone_scripts/run_lastz_intermediate_layer.py
* found chain_gap_filler.py at /path/to/me/make_lastz_chains/standalone_scripts/chain_gap_filler.py
* found faToTwoBit at /cluster/opt/lastz/1.04.15/faToTwoBit
* found twoBitToFa at /cluster/opt/lastz/1.04.15/twoBitToFa
* found pslSortAcc at /cluster/opt/lastz/1.04.15/pslSortAcc
* found axtChain at /cluster/opt/lastz/1.04.15/axtChain
* found axtToPsl at /cluster/opt/lastz/1.04.15/axtToPsl
* found chainAntiRepeat at /cluster/opt/lastz/1.04.15/chainAntiRepeat
* found chainMergeSort at /cluster/opt/lastz/1.04.15/chainMergeSort
* found chainCleaner at /cluster/opt/lastz/1.04.15/chainCleaner
* found chainSort at /cluster/opt/lastz/1.04.15/chainSort
* found chainScore at /cluster/opt/lastz/1.04.15/chainScore
* found chainNet at /cluster/opt/lastz/1.04.15/chainNet
* found chainFilter at /cluster/opt/lastz/1.04.15/chainFilter
* found lastz at /cluster/opt/lastz/1.04.15/lastz
* found nextflow at /cluster/opt/nextflow/23.10.1/nextflow
All necessary executables found.
Making chains for /path/to/me/make_lastz_chains/MesAur_chr_folded.2bit and /path/to/me/make_lastz_chains/mm10.2bit files, saving results to /path/to/me/make_lastz_chains/test_out_1
Pipeline started at 2024-04-30 11:24:17.231861
* Setting up genome sequences for target
genomeID: MesAur_chr_folded
input sequence file: /path/to/me/make_lastz_chains/MesAur_chr_folded.2bit
is 2bit: True
planned genome dir location: /path/to/me/make_lastz_chains/test_out_1/target.2bit
Created symlink from /path/to/me/make_lastz_chains/MesAur_chr_folded.2bit to /path/to/me/make_lastz_chains/test_out_1/target.2bit
For MesAur_chr_folded (target) sequence file: /path/to/me/make_lastz_chains/test_out_1/target.2bit; chrom sizes saved to: /path/to/me/make_lastz_chains/test_out_1/target.chrom.sizes
* Setting up genome sequences for query
genomeID: mm10
input sequence file: /path/to/me/make_lastz_chains/mm10.2bit
is 2bit: True
planned genome dir location: /path/to/me/make_lastz_chains/test_out_1/query.2bit
Created symlink from /path/to/me/make_lastz_chains/mm10.2bit to /path/to/me/make_lastz_chains/test_out_1/query.2bit
For mm10 (query) sequence file: /path/to/me/make_lastz_chains/test_out_1/query.2bit; chrom sizes saved to: /path/to/me/make_lastz_chains/test_out_1/query.chrom.sizes

### Partition Step ###

# Partitioning for target
Saving partitions and creating 238 buckets for lastz output
In particular, 19 partitions for bigger chromosomes
And 219 buckets for smaller scaffolds
Saving target partitions to: /path/to/me/make_lastz_chains/test_out_1/target_partitions.txt
# Partitioning for query
Saving partitions and creating 65 buckets for lastz output
In particular, 64 partitions for bigger chromosomes
And 1 buckets for smaller scaffolds
Saving query partitions to: /path/to/me/make_lastz_chains/test_out_1/query_partitions.txt
Num. target partitions: 19
Num. query partitions: 64
Num. lastz jobs: 1216

### Lastz Alignment Step ###

LASTZ: making jobs
LASTZ: saved 15470 jobs to /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_joblist.txt
Parallel manager: pushing job /cluster/opt/nextflow/23.10.1/nextflow /path/to/me/make_lastz_chains/parallelization/execute_joblist.nf --joblist /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_joblist.txt -c /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_config.nf
N E X T F L O W  ~  version 23.10.1
Launching `/path/to/me/make_lastz_chains/parallelization/execute_joblist.nf` [maniac_thompson] DSL2 - revision: 0483b29723
[84/955b71] process > execute_jobs (27) [  0%] 28 of 3913, failed: 28, retries: 28
[c5/32a7bd] NOTE: Error submitting process 'execute_jobs (18)' for execution -- Execution is retried (1)
[26/dd5dc9] NOTE: Error submitting process 'execute_jobs (4)' for execution -- Execution is retried (1)

May I request assistance here to get the correct syntax?

P.S.. I can confirm the suggested shabang fix in this thread also works to start the sample jobs.

MichaelHiller commented 6 months ago

@kirilenkobm Could you pls have a look if the --cluster_parameters is a retired parameter? Thx

SomePersonSomeWhereInTheWorld commented 6 months ago

Here is the top part of the .nextflow.log. Is there another option I need to use?

Apr-30 12:32:43.670 [main] DEBUG nextflow.cli.Launcher - $> nextflow /path/to/me/make_lastz_chains/parallelization/execute_joblist.nf --joblist /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_joblist.txt -c /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_config.nf
Apr-30 12:32:43.723 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 23.10.1
Apr-30 12:32:43.740 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/cluster/home/me/.nextflow/plugins; core-plugins: nf-amazon@2.1.4,nf-azure@1.3.3,nf-cloudcache@0.3.0,nf-codecommit@0.1.5,nf-console@1.0.6,nf-ga4gh@1.1.0,nf-google@1.8.3,nf-tower@1.6.3,nf-wave@1.0.1
Apr-30 12:32:43.749 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Apr-30 12:32:43.750 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Apr-30 12:32:43.752 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Apr-30 12:32:43.766 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Apr-30 12:32:43.784 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_config.nf
Apr-30 12:32:43.785 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_config.nf
Apr-30 12:32:43.804 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Apr-30 12:32:44.203 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 from script declararion
Apr-30 12:32:44.218 [main] INFO  nextflow.cli.CmdRun - Launching `/path/to/me/make_lastz_chains/parallelization/execute_joblist.nf` [ridiculous_mcnulty] DSL2 - revision: 0483b29723
Apr-30 12:32:44.219 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[]
Apr-30 12:32:44.219 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[]
Apr-30 12:32:44.227 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /cluster/home/me/.nextflow/secrets/store.json
Apr-30 12:32:44.230 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@10f7c76] - activable => nextflow.secret.LocalSecretsProvider@10f7c76
Apr-30 12:32:44.275 [main] DEBUG nextflow.Session - Session UUID: 5fae7dbe-8c74-4805-926b-aa6223f5ae87
Apr-30 12:32:44.275 [main] DEBUG nextflow.Session - Run name: ridiculous_mcnulty
Apr-30 12:32:44.276 [main] DEBUG nextflow.Session - Executor pool size: 24
Apr-30 12:32:44.282 [main] DEBUG nextflow.file.FilePorter - File porter settings maxRetries=3; maxTransfers=50; pollTimeout=null
Apr-30 12:32:44.285 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=72; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
Apr-30 12:32:44.382 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 23.10.1 build 5891
  Created: 12-01-2024 22:01 UTC (17:01 EDT)
  System: Linux 4.18.0-193.el8.x86_64
  Runtime: Groovy 3.0.19 on Java HotSpot(TM) 64-Bit Server VM 20.0.1+9-29
  Encoding: UTF-8 (UTF-8)
  Process: 322176@g261 [10.197.17.16]
  CPUs: 24 - Mem: 50 GB (47.8 GB) - Swap: 0 (0)
Apr-30 12:32:44.424 [main] DEBUG nextflow.Session - Work-dir: /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/work [lustre]
Apr-30 12:32:44.424 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /path/to/me/make_lastz_chains/parallelization/bin
Apr-30 12:32:44.434 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[]
Apr-30 12:32:44.442 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Apr-30 12:32:44.458 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
Apr-30 12:32:44.468 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 25; maxThreads: 1000
Apr-30 12:32:44.551 [main] DEBUG nextflow.Session - Session start
Apr-30 12:32:44.692 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Apr-30 12:32:44.801 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Apr-30 12:32:44.802 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Apr-30 12:32:44.809 [main] DEBUG nextflow.executor.Executor - [warm up] executor > slurm
Apr-30 12:32:44.814 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'slurm' > capacity: 1000; pollInterval: 5s; dumpInterval: 5m 
Apr-30 12:32:44.816 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: slurm)
Apr-30 12:32:44.817 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
Apr-30 12:32:44.869 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: execute_jobs
Apr-30 12:32:44.869 [main] DEBUG nextflow.Session - Igniting dataflow network (2)
Apr-30 12:32:44.874 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > execute_jobs
Apr-30 12:32:44.874 [main] DEBUG nextflow.script.ScriptRunner - Parsed script files:
  Script_f6c411a586096bcb: /path/to/me/make_lastz_chains/parallelization/execute_joblist.nf
Apr-30 12:32:44.874 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination 
Apr-30 12:32:44.874 [main] DEBUG nextflow.Session - Session await
Apr-30 12:32:45.049 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=execute_jobs (5); work-dir=/path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/work/4e/e79fd76c40079431a8db2ed4875930
  error [nextflow.exception.ProcessFailedException]: Error submitting process 'execute_jobs (5)' for execution
Apr-30 12:32:45.057 [Task submitter] INFO  nextflow.processor.TaskProcessor - [4e/e79fd7] NOTE: Error submitting process 'execute_jobs (5)' for execution -- Execution is retried (1)
Apr-30 12:32:45.091 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=execute_jobs (1); work-dir=/path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/work/3d/0dd6dde2634f3b7abf79184db82243
SomePersonSomeWhereInTheWorld commented 6 months ago

@MichaelHiller am I understanding the documentation correctly?

To run the pipeline on a Slurm cluster, for instance, add the --executor slurmoption. Refer to the Nextflow documentation for a list of supported executors.

The Nextflow Slurm page says:

To enable the SLURM executor, set process.executor = 'slurm' in the nextflow.config file. Resource requests and other job characteristics can be controlled via the following process directives: clusterOptions cpus memory queue time

I know --cluster_executor slurm works. So if in an interactive or non-interactive, i.e., SBATCH, job, if --ntasks is specified, does make_chains.py consider Slurm options as noted in the Nextlow docs?

FWIW I do not see --cluster_parameters on the Full list of the pipeline CLI parameters