Resuming a stalled run.

ohdongha commented 2 years ago

Hi again,

We have been running make_lastz_chains for combinations of species pairs, masking options, and LastZ parameter sets. The pipeline usually runs smooth, except a couple of times when the comparison was heavy (larger and more closely related genomes).

On one occasion, the primary lastz alignment step (=total 1,825 jobs) was finished, but in a later step (doChainRun), the pipeline was stalled, failing and retrying 25 jobs. Here is the command used (run on a node with 32-core, with 4GB RAM per core):

echo -e "FILL_CHAIN=0" > DEF_noFillChain # we were skipping RepeatFiller for this run
make_chains.py h38w5 cPTRv2w5 h38_primaryAssembly.WMt98.0.fa cPTRv2.WMt98.0.fa \
   --DEF DEF_noFillChain --executor_queuesize 32 \
   --project_dir 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0 \
   > 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0.log 2>&1

And the log file: 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0__failed.log

We can see the following steps were done:

$ ls -ltrh 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_*/*.done
-rw-r--r-- 1 ohd3 contig 0 Oct 10 09:36 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.lastz/partition.done
-rw-r--r-- 1 ohd3 contig 0 Oct 11 23:16 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.lastz/lastz.done
-rw-r--r-- 1 ohd3 contig 0 Oct 11 23:31 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.cat/cat.done

Hence, we wanted to resume the run on a node with more RAM (8GB per core). We first tried with --resume and received this complaint (path simplified):

Confusion: ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/DEF already exists
Please set --force_def to override

So we tried again with the following:

echo -e "FILL_CHAIN=0" > DEF_noFillChain # we were skipping RepeatFiller for this run
make_chains.py h38w5 cPTRv2w5 h38_primaryAssembly.WMt98.0.fa cPTRv2.WMt98.0.fa \
   --resume --force_def \
   --DEF DEF_noFillChain --executor_queuesize 32 \
   --project_dir 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0 \
   > 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume.log 2>&1

Now the complaint included this line (paths simplified):

doPartition: looks like doPartition was already successful (./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.lastz/partition.done exists).
Either -continue {some later step} or move aside/remove ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.lastz/ and run again.

Questions:

Since removing the TEMP_run.lastz folder may make the pipeline run the 1,825 lastz jobs again, we would like to try -continue {some later step}. What should we use as the {some later step}? Will it be doChainRun? Is there a list of steps where we can resume the pipeline?
Is there any other parameters we can set when resuming? Will increasing --chaining_memory CHAINING_MEMORY to, say, 100000 or 200000 help? The default here seems to be 50000 (MB? per core?).

Thanks a lot! Dong-Ha

P.S.
In case needed, here are the full resume log files: 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume__1stAttempt.log 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume__2ndAttempt.log

ohdongha commented 2 years ago

I found this block in doLastzChains/doLastzChain.pl:

# Specify the steps supported with -continue / -stop:
my $stepper = new HgStepManager(
    [ { name => 'partition',  func => \&doPartition },
      { name => 'lastz',     func => \&doLastzClusterRun },
      { name => 'cat',        func => \&doCatRun },
      { name => 'chainRun',   func => \&doChainRun },
      { name => 'chainMerge', func => \&doChainMerge },
      { name => 'fillChains', func => \&doFillChains },
      { name => 'cleanChains', func => \&doCleanChains }
    ]
);

So, perhaps we can try adding -continue chainRun --chaining_memory 100000?

...

EDIT: Nope, -continue chainRun was not recognized by make_chains.py:

make_chains.py: error: unrecognized arguments: -continue chainRun

... EDIT2: I guess there should be a mechanism to deliver -continue chainRun to doLastzChain.pl in the master_script.sh automatically generated by make_chains.py?

$ cat 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/master_script.sh # path simplified
#!/bin/bash
### Make chains master script
# Antomatically generated by make_chains.py on 2022-10-10 09:36:10.967274

export PATH=/home/ohd3/make_lastz_chains-main/doLastzChains:$PATH
export PATH=/home/ohd3/make_lastz_chains-main/HL_scripts:$PATH
export PATH=/home/ohd3/make_lastz_chains-main/HL_kent_binaries:$PATH
export PATH=/home/ohd3/make_lastz_chains-main/kent_binaries:$PATH

/home/ohd3/make_lastz_chains-main/doLastzChains/doLastzChain.pl ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/DEF -clusterRunDir ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0 --executor local --queueSize 32 2>&1 | tee -a ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/make_chains.log

Please help! ;) Dong-Ha

kirilenkobm commented 2 years ago

Hi @ohdongha

Could you please try --resume without --force_def? Conceptually, --force_def was introduced to avoid running different instances of make_lastz_chains in the same directory. It has nothing to do with the -continue option. I believe these options are in conflict (--force_def should start the run from scratch, but doLastChain still see that something is done). I will add a condition that these 2 are not set simultaneously.

Best, Bogdan

ohdongha commented 2 years ago

Hello Bogdan @kirilenkobm,

Could you please try --resume without --force_def?

I have already tried. And the below was the results (copied from the post above):

Hence, we wanted to resume the run on a node with more RAM (8GB per core). We first tried with --resume and received this complaint (path simplified):
Confusion: ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/DEF already exists
Please set --force_def to override 
See also: 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume__1stAttempt.log

It would be great if --resume is on, the check for existing DEF file could be turned off , and (more importantly) enable re-using all the temporary folders for steps that were already done.

Thanks! Dong-Ha

kirilenkobm commented 2 years ago

Hi @ohdongha

Right, the logic of "to run or not to run" regarding the presence or absence of the DEF file was broken after some pipeline rearrangements. I pushed a fix

Not it works as follows: if --resume arg is True -> check whether DEF file exists, if yes - then continue, exit otherwise If force_def -> run regardless DEF file exists or not default -> create DEF file if it does not exist, exit if it is already present.

Best, Bogdan

ohdongha commented 2 years ago

Thanks, Bogdan @kirilenkobm, I will give it a try.

I also add my questions on how to proceed with the stalled doChainRun step, as #10 since it is a question separate from the issue with resuming.

hillerlab / make_lastz_chains

Resuming a stalled run. #9