The number of tasks submitted by SLURM exceeded the limit

aaannaw commented 1 month ago

Hello, professor I was running the pipeline to align my genome assemblies with mm10 genome via slurm: ./make_chains.py target query mm10.fasta Bsu.softmask.fasta --pd mm-Bsu -f --chaining_memory 30 --cluster_queue pNormal --executor slurm --nextflow_executable /data/01/user157/software/bin/nextflow and I encounter an error after running the command for several minnites:

[fe/5bafab] NOTE: Error submitting process 'execute_jobs (206)' for execution -- Execution is retried (3)
[ff/d8223b] NOTE: Error submitting process 'execute_jobs (212)' for execution -- Execution is retried (3)
[4a/34ad45] NOTE: Error submitting process 'execute_jobs (209)' for execution -- Execution is retried (3)
 ERROR ~ Error executing process > 'execute_jobs (91)'                                                                                           
Caused by:
Failed to submit process to grid scheduler for execution                                                                                                                                                                                            Command executed:                                                                                                                               
sbatch .command.run                                                                                                                                                                                                                                                                           Command exit status:  
1                                                                                                      
Command output: 
sbatch: error: QOSMaxSubmitJobPerUserLimit                                                             
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)                                                                                                                                                
Work dir: 
/data/01/p1/user157/software/make_lastz_chains/mm-Bsu/temp_lastz_run/work/23/a09dba9e82d536f1f39b26de92d7d0
 Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`                                                                                                                                        
 -- Check '.nextflow.log' file for details

The error is because our server limits the maximum number of submitted tasks per person to 100 and I find the default chunk size will generated 1955 jobs, which is well over 100 limited jobs. be3aefc01c900c8caee0c39593d377c Thus, I attempted to increase the chunk size like this: ./make_chains.py target query mm10.fasta Bsu.softmask.fasta --pd mm-Bsu -f --chaining_memory 30 --cluster_queue pNormal --executor slurm --nextflow_executable /data/01/user157/software/bin/nextflow --seq1_chunk 500000000 --seq2_chunk 500000000. However, this still generated 270 jobs as following. a8e97638859f6f9db82b67f20660b4f This is unbelievable. I checked and found that when the number of scaffolds is too much, up to 100 scaffolds are put in a chunk for comparison, even though they don't add up to the chunk size. I don't know what's going on here. Anyway, I think there should exist the method, without increasing the chunk size (as I understand that increasing the chunk size would increases the runtime), that allow me to submit multiple lines command per task, which would guarantee that I would complete 1955 commands with less than 100 tasks submitted! Looking forward with your suggestions! Best wishes! Na Wan

MichaelHiller commented 1 month ago

100 jobs per user is very very restrictive. I typically submit a few thousand jobs.

To get the number down to less than 100, you likely have to increase the chunksize AND the seq limit parameter (no of scaffolds that can be bundled in one job) further. Hope that helps

ohdongha commented 1 month ago

Pardon me for hitchhiking.

100 jobs per user is very very restrictive. I typically submit a few thousand jobs.

@MichaelHiller In the legacy version (v1.0.0), there was a parameter EXECUTOR_QUEUESIZE, which I believe could limit the number of jobs submitted at once:

  --executor_queuesize EXECUTOR_QUEUESIZE
                        Controls NextFlow queueSize parameter: maximal number of
                        jobs in the queue (default 2000)

I realize that v.2.0.8 does not have this parameter. Was there a reason to remove this parameter?

It would be convenient to have a parameter to limit the number of jobs submitted at once. We could create several thousands of jobs and let it run 100 at a time. It will take time for sure, but we won't need to worry about the number of jobs, etc.

ohdongha commented 1 month ago

One thing we could try is to add a generic NextFlow config file with executor.queueSize set to 100 (for slurm in this case). https://www.nextflow.io/docs/latest/config.html#configuration-file

Perhaps the parameter can be added to $HOME/.nextflow/config so that all NextFlow processes can use it (unless overridden by another config file or arguments with higher priority).

aaannaw commented 1 month ago

One thing we could try is to add a generic NextFlow config file with executor.queueSize set to 100 (for slurm in this case). https://www.nextflow.io/docs/latest/config.html#configuration-file

Perhaps the parameter can be added to $HOME/.nextflow/config so that all NextFlow processes can use it (unless overridden by another config file or arguments with higher priority).

Some genomes (e.g. GCF_001194135.2) have large numbers of very small scaffolds. It could help to remove all scaffolds <2Kb or <1.5Kb since I am not sure if we will get anything useful from aligning them.

Hello, @ohdongha This advice is for me？ However, I can not find the config file in the directory .nextflow. Thus I create the config file and edit it as following:

executor {
  name = 'slurm'
  queueSize = 100  // Set your desired queue size here
}

However, I got the same error: sbatch: error: QOSMaxSubmitJobPerUserLimit although I have run "source ~/.zshrc" Maybe could give me any suggestions?

MichaelHiller commented 1 month ago

Sorry, I am not so familiar with NextFlow, but the queueSize parameter could be a good idea. @kirilenkobm Could you pls comment on why this was removed? Maybe it is no longer compatible with the newer NextFlow version that we had to updated to?

aaannaw commented 1 month ago

I attempted to running the older version (v1.0.0) to solve the problem with./make_chains.py target query mm10.fasta 1.Hgl.softmask.fasta --pd mm-Hgl --force_def --chaining_memory 70 --executor_partition pNormal --executor slurm --executor_queuesize 100. However, again, I got the error:

N E X T F L O W  ~  version 23.10.1Nextflow DSL1 is no longer supported — Update your script to DSL2, or use Nextflow 22.10.x or earlier                                           
/data/00/user/user157/miniconda3/lib/python3.9/site-packages/py_nf/py_nf.py:404: UserWarning: Nextflow pipeline lastz_targetquery failed! Execute function returns 1.
  warnings.warn(msg)
Uncaught exception from user code:
        Command failed:
        /data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/TEMP_run.lastz/doClusterRun.sh                                              
        HgAutomate::run("/data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/T"...) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/HgRemoteScript.pm line 117
        HgRemoteScript::execute(HgRemoteScript=HASH(0x55b2dac4aa10)) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 423
        main::doLastzClusterRun() called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/HgStepManager.pm line 169        
        HgStepManager::execute(HgStepManager=HASH(0x55b2dad4d188)) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 877
Error!!! Output file /data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/target.query.allfilled.chain.gz not found!                     
The pipeline crashed. Please contact developers by creating an issue at:                                                                        
https://github.com/hillerlab/make_lastz_chains

ohdongha commented 1 month ago

version 23.10.1Nextflow DSL1 is no longer supported — Update your script to DSL2, or use Nextflow 22.10.x or earlier

@aaannaw For this, one workaround that worked for me was to include this (as a global parameter on the node you run nextflow) when running make_lastz_chain v1:

export NXF_VER=22.10.0

Note: I am not sure if this worked for me because I installed an older version of nextflow first and then updated it with nextflow self-update. After updating, I had the same error as you when running the legacy make_lastz_chains. Setting the variable above solved the problem for me.

Note2: maybe it will work as long as the node can download jar for the older nextflow version: https://github.com/nextflow-io/nextflow/issues/1613

aaannaw commented 1 month ago

@ohdongha I installed nextflow v22.10.8 and there is the parameter --executor_queuesize.

./make_chains.py -h                                                                                                                           
usage: make_chains.py [-h] [--project_dir PROJECT_DIR] [--DEF DEF] [--force_def] [--continue_arg CONTINUE_ARG] [--executor EXECUTOR]
                      [--executor_queuesize EXECUTOR_QUEUESIZE] [--executor_partition EXECUTOR_PARTITION]
                      [--cluster_parameters CLUSTER_PARAMETERS] [--lastz LASTZ] [--seq1_chunk SEQ1_CHUNK] [--seq2_chunk SEQ2_CHUNK]
                      [--blastz_h BLASTZ_H] [--blastz_y BLASTZ_Y] [--blastz_l BLASTZ_L] [--blastz_k BLASTZ_K]                                   
                      [--fill_prepare_memory FILL_PREPARE_MEMORY] [--chaining_memory CHAINING_MEMORY] [--chain_clean_memory CHAIN_CLEAN_MEMORY]
                      target_name query_name target_genome query_genome

Now I run the pipeline with the command: ./make_chains.py target query mm10.fasta 1.Hgl.softmask.fasta --pd mm-Hgl --force_def --chaining_memory 70 --executor_partition pNormal --executor slurm --executor_queuesize 100 However, I get the similar error:

executor >  slurm (100)
[87/924a24] process > execute_jobs (100) [  0%] 0 of 1496

executor >  slurm (100)
[87/924a24] process > execute_jobs (100) [  0%] 0 of 1496
WARN: [SLURM] queue (pNormal) status cannot be fetched
- cmd executed: squeue --noheader -o %i %t -t all -p pNormal -u user157
- exit status : 1
- output      :
  slurm_load_jobs error: Unexpected message received

executor >  slurm (100)
[87/924a24] process > execute_jobs (100) [  0%] 0 of 1496
WARN: [SLURM] queue (pNormal) status cannot be fetched
- cmd executed: squeue --noheader -o %i %t -t all -p pNormal -u user157
- exit status : 1
- output      :
- slurm_load_jobs error: Unexpected message received

It displays "[SLURM] queue (pNormal) status cannot be fetched" but the pNormal partition is corrected:

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
pLiu         up   infinite      1  down* lz17
pLiu         up   infinite     11    mix lz[00,02-09,18-19]
pLiu         up   infinite      1  alloc lz20
pNormal*     up   infinite      3   drng lz[32,35,40]
pNormal*     up   infinite     11    mix lz[25-28,30,33-34,36-39]
pNormal*     up   infinite      1   down lz29
pBig         up   infinite      2    mix lz[10,31]
pBig         up   infinite      1   idle lz11

The log file is attached. Could you give me any suggestions?

1.make_chains.log

MichaelHiller commented 1 month ago

This is likely an issue with your cluster. Can you test submitting any other jobs via Nextflow? The error message is unfortunately completely useless. Maybe @kirilenkobm can have look?

aaannaw commented 1 month ago

@MichaelHiller As the size of the log file exceeded the limit, I only intercepted the first 1000 lines to show above. At the end of the log file there is this error reported to show in the image below, I doubt that this parameter does limit the number of tasks within 100.

This is likely an issue with your cluster. Can you test submitting any other jobs via Nextflow? The error message is unfortunately completely useless. Maybe @kirilenkobm can have look?

ohdongha commented 1 month ago

@ohdongha I installed nextflow v22.10.8 and there is the parameter --executor_queuesize.

@aaannaw If --executor slurm does not work, perhaps you could try this (see also https://github.com/hillerlab/make_lastz_chains/issues/60#issuecomment-2104525776): just submit the entire run to a single computing node with many CPU cores (threads) and set --executor local --executor_queuesize N where N is the number of CPU cores in that single node. I typically add --chaining_memory 200000 (for larger genomes) and ask for a node with >=32 cores and >=200GB RAM. It could take a day or two in wall clock time for larger genomes.

I also plan to try running v.2.0.8 on multiple computing nodes on our HPC system, submitting jobs from the login (head) node that has permission to do so. I will see how it goes.

aaannaw commented 1 month ago

@ohdongha Sorry for the delayed response. I have submitted the entire run to a single computing node with 40 CPUs. In my server, 40 is the number of CPU cores for single node.

However, it seems that the run did not work with parallel way by using all CPUs.

After running 39 hours, only 14% of process is finished, as shown in the make_chains.log.

ohdongha commented 1 month ago

@aaannaw

However, it seems that the run did not work with parallel way by using all CPUs. After running 39 hours, only 14% of process is finished, as shown in the make_chains.log.

For the parallel run, you may need to check the wall time and CPU time if your system reports them after the job is done.

In my case, a recent alignment of human vs. Chinese hamster, for example, took 21.7 hours in wall time and 506.0 hours in CPU time, which means (506/21.7=) 23.3 CPU cores have been used on average. I asked for a node with 32 CPUs for this run. I guess the ratio was not closer to 32 because, after the first lastz step, other steps may have run as fewer parallel jobs or even a single job (e.g., the cleanChain step).

You may want to check this ratio first, perhaps using a smaller genome pair that creates fewer lastz jobs (but more than 40).

If the run is slow, you may also want to check if the two genomes have been masked enough. Michael always emphasizes to use RepeatModeler + RepeatMasker. Masking further with windowmasker may also help. Repeats that have escaped the masking step will increase the runtime and generate a lot of short and useless "pile-up" alignments.

aaannaw commented 1 month ago

@ohdongha I am sure that our genomes are masked with repreatMaske, repeatMolderler, TRF and LTR. I'm trying to determine if the genome provided for mm10 needs to be additionally masked for repeat sequences?

aaannaw commented 1 month ago

@ohdongha Perhaps the required input is hard-masked file but my masked input is soft-masked file.

ohdongha commented 1 month ago

@aaannaw

Perhaps the required input is hard-masked file but my masked input is soft-masked file.

Soft-masked fasta files should be fine (and perhaps needed for the fillChain step). I use the soft-masked, and I see a substantial reduction of runtime and number of chains (and very often just a slight reduction in alignment coverage of CDS, etc.) when I apply more aggressive masking (with windowmasker).

I checked the UCSC mm10 (fasta), and it has ~43.9% of all nucleotides soft-masked. That is close to what I have previously used for mouse GRCm38 (~44.5% masked by windowmasker with -t_thres parameter set to the equivalent of 97.5%).

It is hard to know if the slow progress is due to repeats or the SLURM node not firing up all gears. I guess some tests, e.g., aligning a smaller genome pair, may help.

hillerlab / make_lastz_chains

The number of tasks submitted by SLURM exceeded the limit #64