implement exponential backoff in mc_master?

ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing

https://docs.ropensci.org/drake

GNU General Public License v3.0

1.34k stars 129 forks source link

implement exponential backoff in mc_master? #537

Closed kendonB closed 6 years ago

kendonB commented 6 years ago

I am using future_lapply parallelism and see that fl_master is consuming CPU resources.

Ultimately I think this ends up in mc_master in a while loop that's checking every 0.1 seconds (by default).

Is it possible to have the default be an exponential backoff going from 0.1 seconds to 2 minutes if nothing happens in the loop?

wlandau commented 6 years ago

Interesting. It is certainly possible, but maybe a niceness level in the system2() call would be easier. I will consider a backoff function too, but I am concerned that it might slow down workflows with quick targets.

kendonB commented 6 years ago

Highly doubtful it would slow anything down if the initial value is set low enough. The first value could be set well lower than 0.1, I believe. Of course, note that an exponential backoff resets when something changes.

wlandau commented 6 years ago

Do you know of similar tools that use an exponential backoff like this? The idea is new to me, and it would be nice to see other places where it plays out.

kendonB commented 6 years ago

I believe it's in batchtools somewhere - can look when I'm back at my computer

On Fri, Oct 12, 2018, 7:40 PM Will Landau notifications@github.com wrote:

Do you know of similar tools that use an exponential backoff like this? The idea is new to me, and it would be nice to see other places where it plays out.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/drake/issues/537#issuecomment-429220933, or mute the thread https://github.com/notifications/unsubscribe-auth/AFFKkVKv0ekn_zWiWJjgeo9DAe9HVKyuks5ukDlqgaJpZM4XPBTH .

wlandau commented 6 years ago

Thanks.

@mschubert, what would you recommend?

wlandau commented 6 years ago

Aha, I see it at https://github.com/mllg/batchtools/blob/3b0b1a9a59e377bb4d827e355d6955d66849c9e6/R/sleep.R#L4. I will probably end up adding an optional sleep argument to make(), but with the default still set to function(i){0.01}. Unlike batchtools, drake needs to accommodate local lightweight parallelism too, and I think the default options should accommodate small workflows with low overhead. In any case, the default is easy to change later.

mschubert commented 6 years ago

Yes, every tool that needs to continuously check if new data is available instead of being notified when that is the case will incur a certain CPU cost based on the check interval.

I'm not sure I fully understand the issue, but I guess that's what's going on here.

Backing off the interval based on a low frequency of positive checks makes sense here if a passive notification of result availability is not possible.

However, if drake processes long calls first and short calls later, this should be handled as well (i.e., enabling the check interval to become both longer with no results and shorter with results)

wlandau commented 6 years ago

Thanks! I will continue thinking about backing off the interval as a default. In the meantime, I will merge #545 after the builds complete so users can insert their own backoff functions.

As far as I know, Shiny's reactivity model uses a passive notification system based on callback functions. If I were to rearchitect drake from scratch, I would try to use something like this. Not only would it cut down unnecessary sleeping while minimizing the CPU load, it might also do away with the need to construct the entire dependency graph. I think this is the crux of drake's overhead issues for ~10000+ targets and the dynamic dependency relationships required for #233 and #304.

mschubert commented 6 years ago

it might also do away with the need to construct the entire dependency graph

I'm not sure how this would be possible?

FWIW, if you use clustermq as a backend, polling the result socket should not incur any significant CPU cost if no results arrive (and in this case you wouldn't want a delay by not polling - it doesn't have any advantage except limiting result processing if they arrive very quickly)

wlandau commented 6 years ago

I am not sure if it's possible either, but I think it deserves some thought. With a passive notification model, drake could start with targets with no dependencies. When those targets finish, they could broadcast to the rest of the targets, and that could trigger targets that no longer have anything holding them back.

I did not think clustermq would throttle the CPU, and I am glad you confirmed this. Is this taken care of in w$receive_data()? If not, is there anything you think I should change in drake's cmd_master() function?

https://github.com/ropensci/drake/blob/21f0ba1c17e539f821e4163ab26e8e39a9793a11/R/clustermq.R#L36-L52

wlandau commented 6 years ago

Fixed via #545.

mschubert commented 6 years ago

w$receive_data() will block until a result arrives (at negligible cpu cost), so that's all fine; no need to do anything extra