Closed ohdongha closed 1 year ago
I found this block in doLastzChains/doLastzChain.pl
:
# Specify the steps supported with -continue / -stop:
my $stepper = new HgStepManager(
[ { name => 'partition', func => \&doPartition },
{ name => 'lastz', func => \&doLastzClusterRun },
{ name => 'cat', func => \&doCatRun },
{ name => 'chainRun', func => \&doChainRun },
{ name => 'chainMerge', func => \&doChainMerge },
{ name => 'fillChains', func => \&doFillChains },
{ name => 'cleanChains', func => \&doCleanChains }
]
);
So, perhaps we can try adding -continue chainRun --chaining_memory 100000
?
...
EDIT:
Nope, -continue chainRun
was not recognized by make_chains.py
:
make_chains.py: error: unrecognized arguments: -continue chainRun
...
EDIT2:
I guess there should be a mechanism to deliver -continue chainRun
to doLastzChain.pl
in the master_script.sh
automatically generated by make_chains.py
?
$ cat 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/master_script.sh # path simplified
#!/bin/bash
### Make chains master script
# Antomatically generated by make_chains.py on 2022-10-10 09:36:10.967274
export PATH=/home/ohd3/make_lastz_chains-main/doLastzChains:$PATH
export PATH=/home/ohd3/make_lastz_chains-main/HL_scripts:$PATH
export PATH=/home/ohd3/make_lastz_chains-main/HL_kent_binaries:$PATH
export PATH=/home/ohd3/make_lastz_chains-main/kent_binaries:$PATH
/home/ohd3/make_lastz_chains-main/doLastzChains/doLastzChain.pl ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/DEF -clusterRunDir ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0 --executor local --queueSize 32 2>&1 | tee -a ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/make_chains.log
Please help! ;) Dong-Ha
Hi @ohdongha
Could you please try --resume without --force_def? Conceptually, --force_def was introduced to avoid running different instances of make_lastz_chains in the same directory. It has nothing to do with the -continue option. I believe these options are in conflict (--force_def should start the run from scratch, but doLastChain still see that something is done). I will add a condition that these 2 are not set simultaneously.
Best, Bogdan
Hello Bogdan @kirilenkobm,
Could you please try --resume without --force_def?
I have already tried. And the below was the results (copied from the post above):
Hence, we wanted to resume the run on a node with more RAM (8GB per core). We first tried with --resume and received this complaint (path simplified):
Confusion: ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/DEF already exists Please set --force_def to override
See also: 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume__1stAttempt.log
It would be great if --resume
is on, the check for existing DEF file could be turned off , and (more importantly) enable re-using all the temporary folders for steps that were already done.
Thanks! Dong-Ha
Hi @ohdongha
Right, the logic of "to run or not to run" regarding the presence or absence of the DEF file was broken after some pipeline rearrangements. I pushed a fix
Not it works as follows: if --resume arg is True -> check whether DEF file exists, if yes - then continue, exit otherwise If force_def -> run regardless DEF file exists or not default -> create DEF file if it does not exist, exit if it is already present.
Best, Bogdan
Thanks, Bogdan @kirilenkobm, I will give it a try.
I also add my questions on how to proceed with the stalled doChainRun step
, as #10 since it is a question separate from the issue with resuming.
Hi again,
We have been running
make_lastz_chains
for combinations of species pairs, masking options, and LastZ parameter sets. The pipeline usually runs smooth, except a couple of times when the comparison was heavy (larger and more closely related genomes).On one occasion, the primary
lastz
alignment step (=total 1,825 jobs) was finished, but in a later step (doChainRun
), the pipeline was stalled, failing and retrying 25 jobs. Here is the command used (run on a node with 32-core, with 4GB RAM per core):And the log file: 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0__failed.log
We can see the following steps were done:
Hence, we wanted to resume the run on a node with more RAM (8GB per core). We first tried with
--resume
and received this complaint (path simplified):So we tried again with the following:
Now the complaint included this line (paths simplified):
Questions:
TEMP_run.lastz
folder may make the pipeline run the 1,825lastz
jobs again, we would like to try-continue {some later step}
. What should we use as the{some later step}
? Will it bedoChainRun
? Is there a list of steps where we can resume the pipeline?--chaining_memory CHAINING_MEMORY
to, say, 100000 or 200000 help? The default here seems to be 50000 (MB? per core?).Thanks a lot! Dong-Ha
P.S.
In case needed, here are the full resume log files: 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume__1stAttempt.log 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume__2ndAttempt.log