hrbrmstr / sergeant

:guardsman: Tools to Transform and Query Data with 'Apache' 'Drill'
https://hrbrmstr.github.io/sergeant/
Other
126 stars 13 forks source link

Implement write table methods #19

Open palmaresk8 opened 6 years ago

palmaresk8 commented 6 years ago

It would be nice if the seargeant package could create new tables as I can do with other database drivers. Something like this:

library(sergeant)
library(dplyr)

conn = dbConnect(Drill())

# Load original data
original_data = tbl(conn, "`dfs.downloads`.`original_data `")

# Do some operation
edited_data = original_data %>% 
  filter(name = "John")

# Write the result to a table without loading it into R
dbWriteTable(conn, table = "`dfs.downloads`.`edited_data`", values = edited_data )
hrbrmstr commented 6 years ago

Agreed, but even the ODBC interface has issues with this. i.e. just try doing:

library(DBI)
library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
  ConnectionType = "Zookeeper",
  AuthenticationType = "No Authentication",
  ZKCLusterID = "drillbits1",
  ZkQuorum = "localhost:2181",
  AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc;
  ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;"
) -> drill_con

dbCreateTable(drill_con, "dfs.tmp.mtcars", mtcars)

One reason for the ODBC interface error is the syntax Drill needs.

One of the only ways I initially went down a path on was to save off a data frame as JSON then use CREATE TABLE x AS SELECT cols FROM thejsonfile but that wld assume Drill is running on the local host. Trying to generalize that (e.g. having an option to automatically scp the JSON to a remote Drill server, having an option to point the CTAS call to a specific path assuming it was, say an NFS share, etc) would require doing quite a bit of condition handling and hasn't made it back to the priority list.

I also went down a path of using CTAS and then adding VALUES() but the REST interface rly does not like it when you do that with a ton of columns and values.

I'll keep the issue open tho in the event I get some time or others find it and have other ideas on how to make it work.