ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Connect to a remote database using SSH tunneling inside a drake plan #1279

Closed asiripanich closed 4 years ago

asiripanich commented 4 years ago

Prework

Question

Hi, I'm working with a remote database which I need to create a ssh tunnel to before connecting to it. Could you please advice how I can include these steps in a drake plan?

Here are pieces of what I have so far.

library(drake)
library(mongolite)
library(ssh)
library(callr)

create_ssh_tunnel <- function (ssh_user, ssh_server, ssh_target = "localhost:XXXXX", 
          ssh_tunnel_port = 27019) 
{
  callr::r_bg(function(ssh_server, ssh_user, ssh_target, ssh_tunnel_port) {
    message("Tunneling")
    session <- ssh::ssh_connect(glue::glue("{ssh_user}@{ssh_server}"))
    ssh::ssh_tunnel(session, port = ssh_tunnel_port, target = ssh_target)
  }, args = list(ssh_server, ssh_user, ssh_target, ssh_tunnel_port), 
  stdout = nullfile(), stderr = nullfile())
}

plan <- drake_plan(
  # settings --------------
  ssh_server = "XXX",
  ssh_user = "XXX",
  tunnel = create_ssh_tunnel(ssh_user = ssh_user, ssh_server = ssh_server),
  con = mongolite::mongo(collection = "A_COLLECTION", db = "DB_NAME", url = "localhost:XXXXX"),
  data = con$find()
)

Thanks ;)

wlandau commented 4 years ago

It's good that you defined a function to wrap the steps of connecting to the database. However, I would advise against exposing the details as targets in the plan. Targets are meant to be intermediate data objects that make sense to save to storage and retrieve in another session. SSH tunnel objects and database connection objects are ephemeral and thus do not really make great targets. In addition, each of those targets probably runs fast, so there is not much time saved by splitting all that into different targets. The section at https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets touches on these sorts of best practices.

So in your case, I would wrap all the targets you listed above into a single data target and define functions to wrap it all together. Sketch:

library(drake)
library(mongolite)
library(ssh)
library(callr)

create_ssh_tunnel <- function (ssh_user, ssh_server, ssh_target = "localhost:XXXXX", 
          ssh_tunnel_port = 27019) 
{
  callr::r_bg(function(ssh_server, ssh_user, ssh_target, ssh_tunnel_port) {
    message("Tunneling")
    session <- ssh::ssh_connect(glue::glue("{ssh_user}@{ssh_server}"))
    ssh::ssh_tunnel(session, port = ssh_tunnel_port, target = ssh_target)
  }, args = list(ssh_server, ssh_user, ssh_target, ssh_tunnel_port), 
  stdout = nullfile(), stderr = nullfile())
}

get_data <- function(ssh_server = "XXX", ssh_user = "XXX") {
  tunnel <- create_ssh_tunnel(ssh_user = ssh_user, ssh_server = ssh_server)
  con <- mongolite::mongo(collection = "A_COLLECTION", db = "DB_NAME", url = "localhost:XXXXX")
  data = con$find()
)

plan <- drake_plan(data = get_data())

For more efficient storage of the data you eventually get, I would consider the specialized storage formats at https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets, perhaps "fst_tbl" or "fst_dt".

plan <- drake_plan(data = target(get_data(), format = "fst_dt"))
asiripanich commented 4 years ago

Thanks for the advice. The main problem is with create_ssh_tunnel when I run the above code using r_make() on _drake.R I get the following error:

Error: callr subprocess failed: target data failed.
diagnose(data)error$message:
  No suitable servers found (`serverSelectionTryOnce` set): [socket timeout calling ismaster on 'localhost:27019']
diagnose(get_data)error$calls:
  1. └─global::get_data(...)
  2.   └─mongolite::mongo(...)
  3.     └─mongolite:::mongo_collection_command_simple(col, "{\"ping\":1}")

It seems that the tunnel process disappeared before the database connection could be made.

wlandau commented 4 years ago

Do you get the same error running create_ssh_tunnel() outside drake? Unfortunately, I do not have experience with MongoDB or the resources to test your code myself.

asiripanich commented 4 years ago

Yes it works fine outside of drake. Can you think of any reason why it behaves differently in r_make? I can do some testings and report them here.

wlandau commented 4 years ago

Maybe try lock_envir = FALSE in drake_config()? Maybe your function is trying to modify the global environment somehow and drake is trying to prevent that.

asiripanich commented 4 years ago

I still get the exact same error as before with lock_envir = FALSE in drake_config().

wlandau commented 4 years ago

One next thing to try is use make() instead of r_make(). If the error persists, what happens if you take out r_bg() and then run make() with debug(create_ssh_tunnel)? Are the internal objects like session and ssh_port_tunnel what you expect them to be?