hillerlab / make_lastz_chains

Portable solution to generate genome alignment chains using lastz
MIT License
44 stars 8 forks source link

Error running make_lastz_chains on Slurm cluster #25

Closed philge closed 1 year ago

philge commented 1 year ago

Hi,

I am attaching an error am getting while running make_lastz_chains on Slurm cluster. Can you please help me to fix the issue.

Thanks Philge make_lastz_error.txt

MichaelHiller commented 1 year ago

Hmm, looks like all lastz jobs consistently failed, though this " sbatch: error: Batch job submission failed: Required partition not available (inactive or drain)" may also point to an error with the submission that likely requires @kirilenkobm to have a look.

Do these jobs run for several min or an hour or do they crash immediately. If the former, this could indicate lack of appropriate repeat masking. Did you repeatModel the genome to get a lib and repeatmask with this lib?

philge commented 1 year ago

Hmm, looks like all lastz jobs consistently failed, though this " sbatch: error: Batch job submission failed: Required partition not available (inactive or drain)" may also point to an error with the submission that likely requires @kirilenkobm to have a look.

Do these jobs run for several min or an hour or do they crash immediately. If the former, this could indicate lack of appropriate repeat masking. Did you repeatModel the genome to get a lib and repeatmask with this lib?

Hi Michael, Jobs crash immediately. I have soft masked the genomes.

MichaelHiller commented 1 year ago

OK, then this is a slurm submission problem. Somehow the partition is not available. Can you post some details on your slurm configuration (which partitions etc). @kirilenkobm can you have a look pls?

philge commented 1 year ago

OK, then this is a slurm submission problem. Somehow the partition is not available. Can you post some details on your slurm configuration (which partitions etc). @kirilenkobm can you have a look pls?

Hi, my partition was inactive. I activated it. Below are my current partitions. I used queue2 with make_chains. (base) ubuntu@ip-:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST queue1 inact infinite 1 down~ queue1-dy-m5a16xlarge-6 queue1* inact infinite 19 idle~ queue1-dy-m5a16xlarge-[1-5,7-20] queue2 up infinite 10 alloc queue2-dy-m5a16xlarge-[1-10] queue3 up infinite 10 idle~ queue3-dy-m5a16xlarge-[1-10] queue4 up infinite 10 idle~ queue4-dy-m5a16xlarge-[1-10]

Command I used is, /home/ubuntu/software/make_lastz_chains/make_chains.py hg38 org /home/ubuntu/data/assembly/human/hg38.fa /home/ubuntu/data/assembly/organism/scaffolds_FINAL.fasta.masked --executor slurm --executor_partition queue2 --project_dir make_chains

I get errors like, executor > slurm (2323) [00/c950de] process > execute_jobs (272) [ 35%] 823 of 2323, failed: 823, retries: 823 [1d/5f7dd6] NOTE: Process execute_jobs (935) terminated with an error exit status (2) -- Execution is retried (1) [85/84b67b] NOTE: Process execute_jobs (664) terminated with an error exit status (2) -- Execution is retried (1) [e7/704793] NOTE: Process execute_jobs (1363) terminated with an error exit status (2) -- Execution is retried (1) [a3/c8dac8] NOTE: Process execute_jobs (272) terminated with an error exit status (2) -- Execution is retried (1)

kirilenkobm commented 1 year ago

Hi @philge,

Firstly, I'm sorry to hear that you're facing issues with running the pipeline. I'm considering rewriting the pipeline using the Nextflow language. This will help reduce the number of abstraction levels and enhance portability. However, this transition will take some time.

Could you please inform me which version of Nextflow you're currently using? I'd recommend downgrading to 20.10.0 if possible. Additionally, do review the Nextflow logs. They might not be well-organized at the moment, but I'm planning to develop a wrapper to make them more user-friendly.

In the project directory (in your case, makechains), there should be a directory named TEMP${process_name}/${process_name}. Please navigate to the work directory within. It will have several subdirectories, such as:

ls
00  09  12  1b  24  2d  ...  d7  e0  e9  f2  FB

You can enter any of these directories (it seems all jobs might have failed). Within these, you'll find numerous subdirectories. These will contain several hidden files, for instance:

ls -a
.  ..  .command.begin  .command.err  .command.log  .command.out  .command.run  .command.sh  .exitcode

I suggest checking the .command.err content for more insights.

philge commented 1 year ago

Hi @philge,

Firstly, I'm sorry to hear that you're facing issues with running the pipeline. I'm considering rewriting the pipeline using the Nextflow language. This will help reduce the number of abstraction levels and enhance portability. However, this transition will take some time.

Could you please inform me which version of Nextflow you're currently using? I'd recommend downgrading to 20.10.0 if possible. Additionally, do review the Nextflow logs. They might not be well-organized at the moment, but I'm planning to develop a wrapper to make them more user-friendly.

In the project directory (in your case, makechains), there should be a directory named TEMP${process_name}/${process_name}. Please navigate to the work directory within. It will have several subdirectories, such as:

ls
00  09  12  1b  24  2d  ...  d7  e0  e9  f2  FB

You can enter any of these directories (it seems all jobs might have failed). Within these, you'll find numerous subdirectories. These will contain several hidden files, for instance:

ls -a
.  ..  .command.begin  .command.err  .command.log  .command.out  .command.run  .command.sh  .exitcode

I suggest checking the .command.err content for more insights.

Hi @kirilenkobm I am using Nextflow 22.10.0. I will downgrade to 20.10.0 and try. My .command.err files are empty. I found below error in .command.out, AXTTOPSL COMMAND CRASHED axtToPsl: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory Fixed it as mentioned in https://stackoverflow.com/questions/72133316/libssl-so-1-1-cannot-open-shared-object-file-no-such-file-or-directory Re-running now. Got below error in one log, slurmstepd: error: JOB 26066 ON queue2-dy-m5a16xlarge-4 CANCELLED AT 2023-08-20T13:19:33

@kirilenkobm I am still getting below error, AXTTOPSL COMMAND CRASHED axtToPsl: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory

FIXED above issue. I am using Ubuntu 20.04 Copied /lib/x86_64-linux-gnu/libssl.so.1.0.0 to $HOME/opt/lib/ Copied /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 to $HOME/opt/lib/ Then added export LD_LIBRARY_PATH=$HOME/opt/lib:$LD_LIBRARY_PATH in .bashrc