ropensci / ssh

Native SSH client in R based on libssh
https://docs.ropensci.org/ssh
Other
127 stars 21 forks source link

Make SSH tunnel to query a remote redshift database in R #55

Open fissehab opened 2 years ago

fissehab commented 2 years ago

I am using the below Python code to get data from a remote Redshift database. I am using this function with the reticulate package in R. I want to convert the Python code to R so that I do not face the hassle of setting up an Anaconda environment in the cloud when I push my code (shiny app) to shipapps.io. I read about the SSH package and its capabilities but I could not successfully convert my code using it.

def get_redshift_data(query):
  import pandas as pd
  from sqlalchemy import create_engine
  from sqlalchemy import Table, MetaData
  import pymysql
  import paramiko
  from paramiko import SSHClient
  from sshtunnel import SSHTunnelForwarder
  from sqlalchemy.engine import url as u

  mypkey = paramiko.RSAKey.from_private_key_file('nfw-linux-key.pem')
  sql_username = 'my_sql_username'
  sql_hostname = 'sql_username@xxxx.us-east-1.redshift.amazonaws.com'
  sql_password = 'my_sql_password'
  sql_main_database = 'my_db'
  sql_port = 5439
  ssh_host = '127.0.0.1'
  ssh_user = 'ubuntu'
  ssh_port = 56714
  sql_ip = '1.1.1.1.1'
  with SSHTunnelForwarder(
          (ssh_host, ssh_port),
          ssh_username=ssh_user,
          ssh_pkey=mypkey,
          remote_bind_address=(sql_hostname, sql_port)) as tunnel:
      conn_url = u.URL(drivername='postgresql+psycopg2', username=sql_username, 
                   password=sql_password, host='127.0.0.1', 
                   port=tunnel.local_bind_port, database=sql_main_database)
      conn = create_engine(conn_url)
      result = pd.read_sql(query, conn)
  return result

Then I am sourcing the Python function above to get data from the database.

library(reticulate)
source_python("get_redshift_data.py")
df = get_redshift_data("select top 100 * from my_db.my_table")

But I want to avoid the Python part and use R only.

FlorianSchwendinger commented 2 years ago

You can try something like.

start_tunnel_process <- function(user, server, port, stderr = nullfile()) {
    proc <- callr::r_bg(
        function(user, server, port) {
            ssh_host <- paste(user, server, sep = "@")
            while (TRUE) {
                ssh_session <- ssh::ssh_connect(ssh_host, keyfile = "~/.ssh/id_rsa", verbose = FALSE)
                ssh::ssh_tunnel(ssh_session, port = port, target = "localhost:27017")
                ssh::ssh_disconnect(ssh_session)
            }
        },
        args = list(user, server, port),
        stdout = nullfile(),
        stderr = stderr
    )
    proc
}

I currently evaluating this for my use case with mongolite.

url <- sprintf("mongodb://%s:%s@%s:%s", user, password, host, as.integer(port))
con <- mongo(collection, db = database, url = url)
con$find()
con$disconnect()

You need the while loop since after you disconnect the tunnel will close. The while will create a new tunnel. This also means it will fail, if the time between two connections is to short for the 2nd process to create a new tunnel. You can hack this by putting a fault tolerant loop around your connect function.

for (i in seq_len(100)) {
    status <- try(con <- mongo(collection, db = database, url = url), silent = TRUE)
    if (!inherits(status, "try-error")) break
    Sys.sleep(0.2)
}

This works, but is also kind of a hack since the background R process runs at 100% CPU all the time which is to much.