Closed kendonB closed 4 years ago
It kept going!
✔ 2020-09-21 20:59:41.44 +1200 GMT skip branch polygons_point_sample_60143e02
✔ 2020-09-21 20:59:44.64 +1200 GMT skip branch polygons_point_sample_18988ee7
✔ 2020-09-21 20:59:47.37 +1200 GMT skip branch polygons_point_sample_4d3c30f8
✔ 2020-09-21 20:59:50.36 +1200 GMT skip branch polygons_point_sample_8fee86ed
✔ 2020-09-21 20:59:53.56 +1200 GMT skip branch polygons_point_sample_7dd1855c
✔ 2020-09-21 20:59:56.62 +1200 GMT skip branch polygons_point_sample_a2b67b49
✔ 2020-09-21 20:59:59.58 +1200 GMT skip branch polygons_point_sample_72a5bcb1
✔ 2020-09-21 21:00:02.51 +1200 GMT skip branch polygons_point_sample_6f707e4c
✔ 2020-09-21 21:00:05.40 +1200 GMT skip branch polygons_point_sample_33db920f
Still much much slower to check the targets with tar_cluster_mq but at least it was not stuck
And here tar_make gets allocating targets within a few seconds vs 10 minutes using clustermq.
This is getting delayed on a dynamic target that is not that big - 300 groups.
I also tried removing the cache and starting again so the cache doesn't seem corrupted. It is just very slow to check that dynamic target and get going.
@kendonB, I mocked up a quick example based on what you described and tried it on an SGE cluster, and I cannot reproduce what you see.
# _targets.R
library(targets)
options(clustermq.scheduler = "sge", clustermq.template = "cmq.tmpl")
tar_pipeline(
tar_target(x, seq_len(300)),
tar_target(y, x, pattern = map(x))
)
# console
library(targets)
system.time(tar_make(reporter = "silent")) # runs locally
#> user system elapsed
#> 1.858 0.434 5.858
system.time(tar_make(reporter = "silent")) # skips
#> user system elapsed
#> 1.254 0.252 1.936
system.time(tar_make_clustermq(reporter = "silent")) # skips
#> user system elapsed
#> 1.285 0.230 1.867
tar_destroy()
system.time(tar_make_clustermq(reporter = "silent")) # runs on SGE
#> Master: [9.4s 15.3% CPU]; Worker: [avg 19.6% CPU, max 251.9 Mb]
#> user system elapsed
#> 2.133 0.513 10.673
system.time(tar_make_clustermq(reporter = "silent")) # skips
#> user system elapsed
#> 1.323 0.256 2.235
I am surprised this is happening for you because tar_make_clustermq()
actually runs just like tar_make()
right up until it finds an outdated target with deployment = "remote"
.
So I really do need a runnable reprex so I can replicate the bottleneck myself and to get to the bottom of it.
In the meantime, it might be easier to profile your example and post the flame graphs. For profiling purposes, you will need to set callr_function = NULL
.
library(proffer)
px <- pprof(
tar_make_clustermq(callr_function = NULL),
host = "0.0.0.0", # So you can navigate a browser to the head node of the cluster
port = 8888
)
Sys.info()["nodename"] # used to build the URL of the current node
browseURL("http://url_of_current_node.com:8888")
I also recommend updating to the latest version of targets
. A while back, I fixed an issue that was causing the slice dependencies of branches to get re-hashed every time.
The targets version is recent:
targets * 0.0.0.9000 2020-09-21 [1] Github (wlandau/targets@17a2346)
This is what I get when trying to profile.
> browseURL("http://mahuika02.mahuika.nesi.org.nz:8888")
> xdg-open: no method available for opening 'http://mahuika02.mahuika.nesi.org.nz:8888'
Let me see if I can pull out just the public data from the project and send it to you.
polygons_point_sample
above isn't small (~2GB) so it might be that it's loading the whole lot before checking with cluster_mq vs regular make?
Actually the behaviour would be consistent with checking the dependencies for all the slices before sending work out. Is there an easier way for me to step through what is being run on the head process from an interactive session?
Have you updated your installation of targets
since https://github.com/wlandau/targets/commit/14eb32e5026fc5d5a01ff9b4fdeda36e9b7fe893? Because if _targets/meta/meta
was produced before that commit, that could easily explain the bottleneck you're seeing now when it comes to skipping up-to-date targets.
If you run any of the tar_make*()
functions with callr_function = NULL
, it will run the master process in the current R session and all the known debugging tools will be available to you. And profiling with proffer
will work.
I'm running the current version and removed the targets folder and remade.
What can I put inside debug(.)
to get it to browse after it starts checking dependencies? I'm struggling to step through
First of all, is the problem gone after the update? If not, would you please post a reprex that I can run so I can figure out where the bottleneck is coming from?
If not, I recommend profiling with proffer
so we're really sure we're debugging where we need to. It will might involve tar_load_dep(), but we should really profile to be sure.
I have been running the latest version. To be sure, I also removed the targets folder yesterday and remade. Still seeing the problem. profiling with proffer doesn't seem to be working for me on my cluster:
> .Last.error.trace
Stack trace:
1. proffer:::pprof(tar_make_clustermq(callr_function = NULL, names = "consolid ...
2. proffer:::serve_pprof(pprof = pprof, host = host, port = port, ...
3. proffer:::serve_pprof_impl(args)
4. proffer:::with_safe_path(Sys.getenv("PROFFER_GRAPHVIZ_BIN"), ...
5. processx::process$new(command = pprof_path(), args = args, stdout = "|", ...
6. .subset2(public_bind_env, "initialize")(...)
7. processx:::process_initialize(self, private, command, args, stdin, ...
8. rethrow_call(c_processx_exec, command, c(command, args), stdin, ...
x cannot start processx process '' (system error 2, No such file or directory) @unix/processx.c:590 (processx_exec)
>
I should be able to send the project to you - I'm just in the process of getting it to work with just the public data
What does proffer::pprof_sitrep()
say? You may need to locally install Go and pprof
with proffer::install_go()
. See https://r-prof.github.io/proffer/#non-r-dependencies and https://r-prof.github.io/proffer/#configuration for details.
Got it - proffer won't work because I would need to authenticate through the browser and I have no way to do that on my cluster
You could run record_pprof()
on the cluster, download the profiling samples, and then run serve_pprof()
locally.
As per Murphy, when I remove the non-public part of the project it gets going pretty fast... I will try recording the prof.
So there seems to be some interaction with downstream targets
@wlandau I think I need to give up on this one as it's taking a bit too much time. I made the targets a bit bigger (and fewer) and now targets seems to run fine. I will revisit if I come across it again.
Sure, just let me know.
For reference, here is a counter-reprex with dynamic branching over subsets of a large data frame.:
# _targets.R
options(
tidyverse.quiet = TRUE,
clustermq.scheduler = "sge",
clustermq.template = "sge.tmpl"
)
library(targets)
library(tidyverse)
tar_option_set(format = "fst_tbl")
big_data <- function(groups = 10, reps = 100) {
expand_grid(
tar_group = seq_len(groups),
rep = seq_len(reps)
) %>%
mutate(value = rnorm(n()))
}
mean_data <- function(data) {
data %>%
summarize(group = tar_group[1], mean = mean(value))
}
tar_pipeline(
tar_target(
data,
big_data(300, 4e5),
iteration = "group",
deployment = "local"
),
tar_target(mean, mean_data(data), pattern = map(data))
)
where
#$ -N {{ job_name }}
#$ -t 1-{{ n_jobs }}
#$ -j y
#$ -o logs/
#$ -cwd
#$ -V
module load R/3.6.3
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
Up-to-date targets get skipped fast enough.
> library(targets)
> system.time(tar_make_clustermq(workers = 10))
● run target data
● run branch mean_494e7e27
● run branch mean_f641ae60
# ...
● run branch mean_cf2d91a1
Master: [199.1s 95.2% CPU]; Worker: [avg 1.5% CPU, max 325.4 Mb]
user system elapsed
273.531 125.723 409.420
>
> system.time(tar_make_clustermq(workers = 10))
✔ skip target data
✔ skip branch mean_cf2d91a1
✔ skip branch mean_494e7e27
# ...
✔ skip branch mean_29eeecfa
✔ Already up to date.
user system elapsed
2.622 0.440 4.755
>
> tar_meta(data, bytes) # 1.14 GB
# A tibble: 1 x 2
name bytes
<chr> <int>
1 data 1138662810
In answer to the comment you deleted, the low occupancy is probably due to the fact that the master process has trouble keeping up with so many quick-to-built branches (sub-targets). If you 400 workers and the targets run quickly, the master process has to really work hard to keep all the workers busy.
When you start getting into the realm of thousands of branches, it's usually best to start batching computation into a smaller number of slower targets. See https://github.com/wlandau/targets-stan for an example of batching. tarchetypes::tar_rep()
tries to make batched replication easier. Other (mostly static) branching utilities are at https://wlandau.github.io/tarchetypes/reference/index.html#section-branching.
Even this line from clustermq
tells me that https://github.com/wlandau/targets/issues/169#issuecomment-697040994 had to work hard to keep 10 workers busy, and those 10 workers still were not that busy. It tells me that I may have needed to batch a bit more.
Master: [199.1s 95.2% CPU]; Worker: [avg 1.5% CPU, max 325.4 Mb]
I now think that it was partly due to garbage_collection = TRUE
being on. Does this run in between subtargets that get checked but not run?
The targets each take around a minute when running in a single process or with 100 workers. When I bump it up to 200, the whole system just seems to stall after the first round of allocated targets. Have you tried a project with 200 workers going at once?
Every log I examine shows this:
2020-09-23 13:01:53.719059 | > WORKER_WAIT (0.001s wait)
2020-09-23 13:01:53.719415 | waiting 8.60s
2020-09-23 13:02:02.357401 | > WORKER_WAIT (0.000s wait)
2020-09-23 13:02:02.357942 | waiting 8.60s
2020-09-23 13:02:10.970166 | > WORKER_WAIT (0.001s wait)
2020-09-23 13:02:10.970613 | waiting 8.60s
2020-09-23 13:02:19.605107 | > WORKER_WAIT (0.001s wait)
2020-09-23 13:02:19.605544 | waiting 8.60s
2020-09-23 13:02:40.884501 | > DO_CALL (0.403s wait)
Registered S3 method overwritten by 'pryr':
method from
print.bytes Rcpp
237 MB # This is printed by the target code
2020-09-23 13:04:39.639825 | eval'd: target_run_remotetargetgarbage_collection
and the master is idle: CPU = 0%. Same as the one I posted before, where not all of the targets (172/200) initially got allocated.
Note that this is without garbage collection on.
I now think that it was partly due to garbage_collection = TRUE being on. Does this run in between subtargets that get checked but not run?
Thanks for catching that. Garbage collection should not run for skipped targets. Should be fixed in b1f1a3b16239d5dafa112b7e63e9ecb4b618edde.
An inexplicable stall sounds a lot like a memory consumption issue to me. Once again, I really need a reprex.
Also, like I said in https://github.com/wlandau/targets/issues/169#issuecomment-697042322, more workers do not always help. You might not actually be using them all. The master process has a lot of work to do, especially if you task it with the data management (default), and it may not be able to keep up. Please keep eye on those clustermq
occupancy messages and consider batching. One minute per branch is actually quite fast, and you may be creating overhead, which in turn means the idle workers need to stay up longer. In fact, my company's sys admins recently told us to keep it to 100 workers or fewer.
Now for some guesswork about that stall: tar_option_set(storage = "remote", retrieval = "remote")
might help (equivalent to drake::make(caching = "worker")
). But keep in mind that if you do that, each worker is going to have to load the entire upstream dataset to get the slice it needs. Local branching early on could potentially address this by breaking up the data into dedicated branch targets first. Sketch:
# _targets.R
# ...
tar_option_set(storage = "remote", retrieval = "remote")
tar_pipeline(
tar_target(
data,
big_data(300, 4e5),
iteration = "group",
deployment = "local"
),
tar_target(slice, data, pattern = map(data), deployment = "local"),
tar_target(analysis, run_analysis(slice), pattern = map(slice))
)
I did try the following pipeline on 200 workers on SGE, and it completed just fine.
options(
tidyverse.quiet = TRUE,
clustermq.scheduler = "sge",
clustermq.template = "sge.tmpl"
)
library(targets)
library(tidyverse)
tar_option_set(format = "fst_tbl")
big_data <- function(groups = 10, reps = 100) {
expand_grid(
tar_group = seq_len(groups),
rep = seq_len(reps)
) %>%
mutate(value = rnorm(n()))
}
mean_data <- function(data) {
# This sleep line just makes sure the whole array job dequeues before all the targets complete.
# The pipeline runs fine for me with or without sleeping.
Sys.sleep(360)
data %>%
summarize(group = tar_group[1], mean = mean(value))
}
tar_pipeline(
tar_target(
data,
big_data(300, 1e5),
iteration = "group",
deployment = "local"
),
tar_target(slice, data, pattern = map(data), deployment = "local"),
tar_target(mean, mean_data(slice), pattern = map(slice))
)
Here's another discussion of batching: https://wlandau.github.io/targets-manual/dynamic.html#batching
I did try the following pipeline on 200 workers on SGE, and it completed just fine.
options( tidyverse.quiet = TRUE, clustermq.scheduler = "sge", clustermq.template = "sge.tmpl" ) library(targets) library(tidyverse) tar_option_set(format = "fst_tbl") big_data <- function(groups = 10, reps = 100) { expand_grid( tar_group = seq_len(groups), rep = seq_len(reps) ) %>% mutate(value = rnorm(n())) } mean_data <- function(data) { # This sleep line just makes sure the whole array job dequeues before all the targets complete. # The pipeline runs fine for me with or without sleeping. Sys.sleep(360) data %>% summarize(group = tar_group[1], mean = mean(value)) } tar_pipeline( tar_target( data, big_data(300, 1e5), iteration = "group", deployment = "local" ), tar_target(slice, data, pattern = map(data), deployment = "local"), tar_target(mean, mean_data(slice), pattern = map(slice)) )
Did you find that all the workers ended up doing something here? You could monitor using reporter = "summary"
?
The workers in the array job on my cluster queue slowly, and I observed that new targets deployed pretty much as fast as new SGE workers initialized. So it looks like all the workers ended up doing something.
Prework
targets
and most likely not a user error. (If you run into an error and do not know the cause, please submit a "Trouble" issue instead.)Description
I tried getting a clustermq job going a couple of times. It successfully got going quickly and I canceled them. Now it is just stuck before getting going. I'm guessing the cache is corrupted somehow.
It would get going very quickly when running with
tar_make
immediately before.Output code