Closed asiripanich closed 4 years ago
It's good that you defined a function to wrap the steps of connecting to the database. However, I would advise against exposing the details as targets in the plan. Targets are meant to be intermediate data objects that make sense to save to storage and retrieve in another session. SSH tunnel objects and database connection objects are ephemeral and thus do not really make great targets. In addition, each of those targets probably runs fast, so there is not much time saved by splitting all that into different targets. The section at https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets touches on these sorts of best practices.
So in your case, I would wrap all the targets you listed above into a single data
target and define functions to wrap it all together. Sketch:
library(drake)
library(mongolite)
library(ssh)
library(callr)
create_ssh_tunnel <- function (ssh_user, ssh_server, ssh_target = "localhost:XXXXX",
ssh_tunnel_port = 27019)
{
callr::r_bg(function(ssh_server, ssh_user, ssh_target, ssh_tunnel_port) {
message("Tunneling")
session <- ssh::ssh_connect(glue::glue("{ssh_user}@{ssh_server}"))
ssh::ssh_tunnel(session, port = ssh_tunnel_port, target = ssh_target)
}, args = list(ssh_server, ssh_user, ssh_target, ssh_tunnel_port),
stdout = nullfile(), stderr = nullfile())
}
get_data <- function(ssh_server = "XXX", ssh_user = "XXX") {
tunnel <- create_ssh_tunnel(ssh_user = ssh_user, ssh_server = ssh_server)
con <- mongolite::mongo(collection = "A_COLLECTION", db = "DB_NAME", url = "localhost:XXXXX")
data = con$find()
)
plan <- drake_plan(data = get_data())
For more efficient storage of the data you eventually get, I would consider the specialized storage formats at https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets, perhaps "fst_tbl" or "fst_dt".
plan <- drake_plan(data = target(get_data(), format = "fst_dt"))
Thanks for the advice. The main problem is with create_ssh_tunnel
when I run the above code using r_make()
on _drake.R
I get the following error:
Error: callr subprocess failed: target data failed.
diagnose(data)error$message:
No suitable servers found (`serverSelectionTryOnce` set): [socket timeout calling ismaster on 'localhost:27019']
diagnose(get_data)error$calls:
1. └─global::get_data(...)
2. └─mongolite::mongo(...)
3. └─mongolite:::mongo_collection_command_simple(col, "{\"ping\":1}")
It seems that the tunnel process disappeared before the database connection could be made.
Do you get the same error running create_ssh_tunnel()
outside drake
? Unfortunately, I do not have experience with MongoDB or the resources to test your code myself.
Yes it works fine outside of drake. Can you think of any reason why it behaves differently in r_make? I can do some testings and report them here.
Maybe try lock_envir = FALSE
in drake_config()
? Maybe your function is trying to modify the global environment somehow and drake
is trying to prevent that.
I still get the exact same error as before with lock_envir = FALSE
in drake_config()
.
One next thing to try is use make()
instead of r_make()
. If the error persists, what happens if you take out r_bg()
and then run make()
with debug(create_ssh_tunnel)
? Are the internal objects like session
and ssh_port_tunnel
what you expect them to be?
Prework
drake
's code of conduct.drake-r-package
tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)Question
Hi, I'm working with a remote database which I need to create a ssh tunnel to before connecting to it. Could you please advice how I can include these steps in a drake plan?
Here are pieces of what I have so far.
Thanks ;)