Open aaannaw opened 1 month ago
100 jobs per user is very very restrictive. I typically submit a few thousand jobs.
To get the number down to less than 100, you likely have to increase the chunksize AND the seq limit parameter (no of scaffolds that can be bundled in one job) further. Hope that helps
Pardon me for hitchhiking.
100 jobs per user is very very restrictive. I typically submit a few thousand jobs.
@MichaelHiller In the legacy version (v1.0.0), there was a parameter EXECUTOR_QUEUESIZE, which I believe could limit the number of jobs submitted at once:
--executor_queuesize EXECUTOR_QUEUESIZE
Controls NextFlow queueSize parameter: maximal number of
jobs in the queue (default 2000)
I realize that v.2.0.8 does not have this parameter. Was there a reason to remove this parameter?
It would be convenient to have a parameter to limit the number of jobs submitted at once. We could create several thousands of jobs and let it run 100 at a time. It will take time for sure, but we won't need to worry about the number of jobs, etc.
One thing we could try is to add a generic NextFlow config
file with executor.queueSize
set to 100 (for slurm
in this case).
https://www.nextflow.io/docs/latest/config.html#configuration-file
Perhaps the parameter can be added to $HOME/.nextflow/config
so that all NextFlow processes can use it (unless overridden by another config file or arguments with higher priority).
One thing we could try is to add a generic NextFlow
config
file withexecutor.queueSize
set to 100 (forslurm
in this case). https://www.nextflow.io/docs/latest/config.html#configuration-filePerhaps the parameter can be added to
$HOME/.nextflow/config
so that all NextFlow processes can use it (unless overridden by another config file or arguments with higher priority).Some genomes (e.g. GCF_001194135.2) have large numbers of very small scaffolds. It could help to remove all scaffolds <2Kb or <1.5Kb since I am not sure if we will get anything useful from aligning them.
Hello, @ohdongha This advice is for me? However, I can not find the config file in the directory .nextflow. Thus I create the config file and edit it as following:
executor {
name = 'slurm'
queueSize = 100 // Set your desired queue size here
}
However, I got the same error: sbatch: error: QOSMaxSubmitJobPerUserLimit
although I have run "source ~/.zshrc
"
Maybe could give me any suggestions?
Sorry, I am not so familiar with NextFlow, but the queueSize parameter could be a good idea. @kirilenkobm Could you pls comment on why this was removed? Maybe it is no longer compatible with the newer NextFlow version that we had to updated to?
I attempted to running the older version (v1.0.0) to solve the problem with./make_chains.py target query mm10.fasta 1.Hgl.softmask.fasta --pd mm-Hgl --force_def --chaining_memory 70 --executor_partition pNormal --executor slurm --executor_queuesize 100
. However, again, I got the error:
N E X T F L O W ~ version 23.10.1Nextflow DSL1 is no longer supported — Update your script to DSL2, or use Nextflow 22.10.x or earlier
/data/00/user/user157/miniconda3/lib/python3.9/site-packages/py_nf/py_nf.py:404: UserWarning: Nextflow pipeline lastz_targetquery failed! Execute function returns 1.
warnings.warn(msg)
Uncaught exception from user code:
Command failed:
/data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/TEMP_run.lastz/doClusterRun.sh
HgAutomate::run("/data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/T"...) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/HgRemoteScript.pm line 117
HgRemoteScript::execute(HgRemoteScript=HASH(0x55b2dac4aa10)) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 423
main::doLastzClusterRun() called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/HgStepManager.pm line 169
HgStepManager::execute(HgStepManager=HASH(0x55b2dad4d188)) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 877
Error!!! Output file /data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/target.query.allfilled.chain.gz not found!
The pipeline crashed. Please contact developers by creating an issue at:
https://github.com/hillerlab/make_lastz_chains
version 23.10.1Nextflow DSL1 is no longer supported — Update your script to DSL2, or use Nextflow 22.10.x or earlier
@aaannaw For this, one workaround that worked for me was to include this (as a global parameter on the node you run nextflow
) when running make_lastz_chain
v1:
export NXF_VER=22.10.0
Note: I am not sure if this worked for me because I installed an older version of nextflow
first and then updated it with nextflow self-update
. After updating, I had the same error as you when running the legacy make_lastz_chains
. Setting the variable above solved the problem for me.
Note2: maybe it will work as long as the node can download jar
for the older nextflow
version: https://github.com/nextflow-io/nextflow/issues/1613
@ohdongha
I installed nextflow v22.10.8 and there is the parameter --executor_queuesize.
./make_chains.py -h
usage: make_chains.py [-h] [--project_dir PROJECT_DIR] [--DEF DEF] [--force_def] [--continue_arg CONTINUE_ARG] [--executor EXECUTOR]
[--executor_queuesize EXECUTOR_QUEUESIZE] [--executor_partition EXECUTOR_PARTITION]
[--cluster_parameters CLUSTER_PARAMETERS] [--lastz LASTZ] [--seq1_chunk SEQ1_CHUNK] [--seq2_chunk SEQ2_CHUNK]
[--blastz_h BLASTZ_H] [--blastz_y BLASTZ_Y] [--blastz_l BLASTZ_L] [--blastz_k BLASTZ_K]
[--fill_prepare_memory FILL_PREPARE_MEMORY] [--chaining_memory CHAINING_MEMORY] [--chain_clean_memory CHAIN_CLEAN_MEMORY]
target_name query_name target_genome query_genome
Now I run the pipeline with the command:
./make_chains.py target query mm10.fasta 1.Hgl.softmask.fasta --pd mm-Hgl --force_def --chaining_memory 70 --executor_partition pNormal --executor slurm --executor_queuesize 100
However, I get the similar error:
executor > slurm (100)
[87/924a24] process > execute_jobs (100) [ 0%] 0 of 1496
executor > slurm (100)
[87/924a24] process > execute_jobs (100) [ 0%] 0 of 1496
WARN: [SLURM] queue (pNormal) status cannot be fetched
- cmd executed: squeue --noheader -o %i %t -t all -p pNormal -u user157
- exit status : 1
- output :
slurm_load_jobs error: Unexpected message received
executor > slurm (100)
[87/924a24] process > execute_jobs (100) [ 0%] 0 of 1496
WARN: [SLURM] queue (pNormal) status cannot be fetched
- cmd executed: squeue --noheader -o %i %t -t all -p pNormal -u user157
- exit status : 1
- output :
- slurm_load_jobs error: Unexpected message received
It displays "[SLURM] queue (pNormal) status cannot be fetched" but the pNormal partition is corrected:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
pLiu up infinite 1 down* lz17
pLiu up infinite 11 mix lz[00,02-09,18-19]
pLiu up infinite 1 alloc lz20
pNormal* up infinite 3 drng lz[32,35,40]
pNormal* up infinite 11 mix lz[25-28,30,33-34,36-39]
pNormal* up infinite 1 down lz29
pBig up infinite 2 mix lz[10,31]
pBig up infinite 1 idle lz11
The log file is attached. Could you give me any suggestions?
This is likely an issue with your cluster. Can you test submitting any other jobs via Nextflow? The error message is unfortunately completely useless. Maybe @kirilenkobm can have look?
@MichaelHiller As the size of the log file exceeded the limit, I only intercepted the first 1000 lines to show above. At the end of the log file there is this error reported to show in the image below, I doubt that this parameter does limit the number of tasks within 100.
This is likely an issue with your cluster. Can you test submitting any other jobs via Nextflow? The error message is unfortunately completely useless. Maybe @kirilenkobm can have look?
@ohdongha I installed nextflow v22.10.8 and there is the parameter --executor_queuesize.
@aaannaw If --executor slurm
does not work, perhaps you could try this (see also https://github.com/hillerlab/make_lastz_chains/issues/60#issuecomment-2104525776): just submit the entire run to a single computing node with many CPU cores (threads) and set --executor local --executor_queuesize N
where N is the number of CPU cores in that single node. I typically add --chaining_memory 200000
(for larger genomes) and ask for a node with >=32 cores and >=200GB RAM. It could take a day or two in wall clock time for larger genomes.
I also plan to try running v.2.0.8 on multiple computing nodes on our HPC system, submitting jobs from the login (head) node that has permission to do so. I will see how it goes.
@ohdongha Sorry for the delayed response. I have submitted the entire run to a single computing node with 40 CPUs. In my server, 40 is the number of CPU cores for single node.
However, it seems that the run did not work with parallel way by using all CPUs.
After running 39 hours, only 14% of process is finished, as shown in the make_chains.log.
@aaannaw
However, it seems that the run did not work with parallel way by using all CPUs. After running 39 hours, only 14% of process is finished, as shown in the make_chains.log.
For the parallel run, you may need to check the wall time and CPU time if your system reports them after the job is done.
In my case, a recent alignment of human vs. Chinese hamster, for example, took 21.7 hours in wall time and 506.0 hours in CPU time, which means (506/21.7=) 23.3 CPU cores have been used on average. I asked for a node with 32 CPUs for this run. I guess the ratio was not closer to 32 because, after the first lastz
step, other steps may have run as fewer parallel jobs or even a single job (e.g., the cleanChain
step).
You may want to check this ratio first, perhaps using a smaller genome pair that creates fewer lastz
jobs (but more than 40).
If the run is slow, you may also want to check if the two genomes have been masked enough. Michael always emphasizes to use RepeatModeler
+ RepeatMasker
. Masking further with windowmasker
may also help. Repeats that have escaped the masking step will increase the runtime and generate a lot of short and useless "pile-up" alignments.
@ohdongha I am sure that our genomes are masked with repreatMaske, repeatMolderler, TRF and LTR. I'm trying to determine if the genome provided for mm10 needs to be additionally masked for repeat sequences?
@ohdongha Perhaps the required input is hard-masked file but my masked input is soft-masked file.
@aaannaw
Perhaps the required input is hard-masked file but my masked input is soft-masked file.
Soft-masked fasta files should be fine (and perhaps needed for the fillChain
step). I use the soft-masked, and I see a substantial reduction of runtime and number of chains (and very often just a slight reduction in alignment coverage of CDS, etc.) when I apply more aggressive masking (with windowmasker
).
I checked the UCSC mm10 (fasta), and it has ~43.9% of all nucleotides soft-masked. That is close to what I have previously used for mouse GRCm38 (~44.5% masked by windowmasker
with -t_thres
parameter set to the equivalent of 97.5%).
It is hard to know if the slow progress is due to repeats or the SLURM node not firing up all gears. I guess some tests, e.g., aligning a smaller genome pair, may help.
Hello, professor I was running the pipeline to align my genome assemblies with mm10 genome via slurm:
./make_chains.py target query mm10.fasta Bsu.softmask.fasta --pd mm-Bsu -f --chaining_memory 30 --cluster_queue pNormal --executor slurm --nextflow_executable /data/01/user157/software/bin/nextflow
and I encounter an error after running the command for several minnites:The error is because our server limits the maximum number of submitted tasks per person to 100 and I find the default chunk size will generated 1955 jobs, which is well over 100 limited jobs. Thus, I attempted to increase the chunk size like this:
./make_chains.py target query mm10.fasta Bsu.softmask.fasta --pd mm-Bsu -f --chaining_memory 30 --cluster_queue pNormal --executor slurm --nextflow_executable /data/01/user157/software/bin/nextflow --seq1_chunk 500000000 --seq2_chunk 500000000
. However, this still generated 270 jobs as following. This is unbelievable. I checked and found that when the number of scaffolds is too much, up to 100 scaffolds are put in a chunk for comparison, even though they don't add up to the chunk size. I don't know what's going on here. Anyway, I think there should exist the method, without increasing the chunk size (as I understand that increasing the chunk size would increases the runtime), that allow me to submit multiple lines command per task, which would guarantee that I would complete 1955 commands with less than 100 tasks submitted! Looking forward with your suggestions! Best wishes! Na Wan