mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Clustermq timeout at high number of targets workers #229

Closed liutiming closed 3 years ago

liutiming commented 3 years ago

I have a target that generates around 1000 branches, which each contain a foreach loop using dopar.

The default number of core is 32, n_job is 32.

When submitting tar_make_clustermq(workers = 32, log_worker = TRUE), all logs have timeout errors.

When submitting the same targets script with tar_make_clustermq(workers = 2, log_worker = TRUE), the jobs are done successfully. I am wondering whether it is the overhead of targets with more workers that potentially lead to the timeout. I am running it on a cluster with a fair use policy (so some jobs will be at PEND stage at the start).

@wlandau, I think this is the issue you talked about before that clustermq fails silently? Targets were still running in interactive terminal but the jobs have been killed. I know @mschubert is working on this and I really appreciate that!

liutiming commented 3 years ago

oops, workers = 2 failed again when I tried to scale up (using the real dataset instead of a truncated one). Common data is 8736.4 Mb... I read from the NEWS.md that this was improved but I can read into those issues...

mschubert commented 3 years ago

You can change the worker timeout: https://github.com/mschubert/clustermq/blob/master/vignettes/userguide.Rmd#L343-L344

Does that help?

liutiming commented 3 years ago

Thanks! I think I have adjusted that to 6000 and will let you know!

On a side note, I raised a issue here https://github.com/wlandau/targets/issues/251

I am curious what is your take on this? In the clustermq log (produced one for each target worker), I do not see the signs of submitting further jobs as a result of using foreach. Is there any place I can verify that?

wlandau commented 3 years ago

Are you using foreach to submit more clustermq workers within targets that are already running within clustermq workers? Or am I misunderstanding and you are using multicore parallelism within targets (recommended)?

liutiming commented 3 years ago

@wlandau Mine is like this

test_foreach  <- function(list_1, list_2) {
    df_2 <- list_2  %>% pluck(1)
    x <- foreach(df_1 = list_1, .combine = "rbind") %dopar% {
        #do some test
        tibble()
    }
    x
}

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

target  <- list(
    tar_target(test, test_foreach(list_1, list_2), map(list_2))
    , tar_target(list_1, function_1)
    , tar_target(list_2, function_2)
)

So I think I am using multicore parallelism within targets?

P.S. I sourced the function from a script

liutiming commented 3 years ago

@mschubert unfortunately after I set clustermq.worker.timeout = 6000, most workers still exit by timeout (some finished beforehand)

here is an example output: FYI bc-30-2-12 is the node where I needed to start R, otherwise there will not be enough memory to send the jobs... it is then submitted to other nodes. The R session was submitted with bsub -q yesterday -R"select[mem>30000] rusage[mem=30000]" -M30000 -Is R and I am not trying to submit it to a normal queue to see if there is any change. (updates: submitting from another R session from a different queue did not help). It is very weird because previously there is no such time out errors (may be because targets were generated quite quickly, unlike this time?) even though a lot submitted jobs are pending. This time the jobs are not pending but it still result in time-out.

> clustermq:::worker("tcp://bc-30-2-12:9332")
2020-12-30 22:11:15.917879 | Master: tcp://bc-30-2-12:9332
2020-12-30 22:11:15.940585 | WORKER_UP to: tcp://bc-30-2-12:9332
2020-12-30 22:12:14.044555 | > DO_SETUP (18.293s wait)
2020-12-30 22:12:14.046469 | token from msg: set_common_data_token
2020-12-30 22:18:58.082127 | > DO_CALL (58.606s wait)
2020-12-30 22:42:39.818437 | eval'd: target_run_workertarget
Error in clustermq:::worker("tcp://bc-30-2-12:9332") : 
  Timeout reached, terminating
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted
liutiming commented 3 years ago

to add on: most jobs show timeout after 600 seconds instead of 6000, showing that the option is implemented?

mschubert commented 3 years ago
options([...], clustermq.worker.timeout = 6000)

Your timeout is set for the master, but the worker accesses it via an option again (and hence doesn't know about it)

You need to either:

See here: https://github.com/mschubert/clustermq/blob/master/R/worker.r#L10-L11

liutiming commented 3 years ago

Thanks so much! I will add that to my Rprofile for now. How can I do it via the template? I am not familiar with the template file so would really appreciate if there are some resources on this and I do not have to bother to that much ><

liutiming commented 3 years ago

Do you add it on this line?

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

but i am not too sure where to add Re the curly brackets

mschubert commented 3 years ago

It would be (untested):

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout=getOption("clustermq.worker.timeout", 600))'
liutiming commented 3 years ago

Thanks! I will test that later.

For now though, after setting it in .Rprofile, the timeout issue might have been fixed but there is a memory allocation error:

> clustermq:::worker("tcp://node-12-2-3:7199")
2020-12-31 09:56:41.175863 | Master: tcp://node-12-2-3:7199
2020-12-31 09:56:41.194812 | WORKER_UP to: tcp://node-12-2-3:7199
2020-12-31 09:57:08.695212 | > DO_SETUP (17.643s wait)
2020-12-31 09:57:08.697145 | token from msg: set_common_data_token
2020-12-31 09:59:48.006753 | > DO_CALL (57.754s wait)
2020-12-31 10:23:31.016150 | eval'd: target_run_workertarget
Error: cannot allocate vector of size 5.2 Gb
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

This is strange because in my tmpl file

#BSUB-R "span[hosts=1] select[mem>20000] rusage[mem=20000]"  # Memory requirements in Mbytes

and clustermq register option:

clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

both requested memory of 20000 MB so that should have met the requirement and there should not be a memory issue?

The master has 30 GB of memory and the the error does not come from there (targets still appear to be running despite workers have failed) so I do not think that is where the issue is...

liutiming commented 3 years ago

I am adding

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

to .Rprofile as a test

mschubert commented 3 years ago

Error: cannot allocate vector of size 5.2 Gb

This error means that the worker can not allocate 5.2 Gb on top of what it is already using. So most likely this brings you over 20 Gb total, and then it crashes.

wlandau commented 3 years ago

It's worth noting here that targets runs tar_make() etc. in a callr process, callr::r() by default. callr::r() ignores your personal global .Rprofile file, so the option would need to be set either in your project's specific .Rprofile or in _targets.R.

liutiming commented 3 years ago

Thank you both @mschubert @wlandau!

I have put the following config

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

in both main ~/.Rprofile and _targets.R. In the log file it starts to submit jobs like this, which was never seen before, so I am not sure whether this is the right way to do...

Submitting 32 worker jobs (ID: cmq7002) ...
Running 29,695 calculations (2 objs/0.1 Mb common; 1 calls/chunk) ...
Master: [3.9s 62.4% CPU]; Worker: [avg 10.2% CPU, max 248.6 Mb]

However the job have been submitted for a few hours and I think something has gone wrong - one target usually only takes about 30min so I had to reduce the timeout again and see what is the cause...

mschubert commented 3 years ago

To sum up, this does not seem to be a timeout issue but related to nesting targets jobs with in-target foreach that itself submits jobs.

I will close this, but please reopen if you have a reproducible example that illustrates a problem on clustermq's side (aside from not yet tracking worker crashes, that's in https://github.com/mschubert/clustermq/issues/223)