Open nfultz opened 5 years ago
I have a similar anecdote on the impala driver at http://datascience.la/r-and-impala-its-better-to-kiss-than-using-java :D But it turned out the JDBC driver was OK, the R wrapper was slow. Anyway, I think this is a promising idea, especially that we have really cool CSV parsers in R, so tweaking the column classes should be fairly easy. On the other hand, the classes will be auto-guessed by default instead of using the actual colclasses of the table or you plan to write some wrappers to do that properly?
One way I got around this was in https://github.com/hrbrmstr/awsathena/blob/master/R/collect-async.R (tidyverse-oriented) but once you get a query execution ID I have https://github.com/hrbrmstr/awsathena/blob/master/R/download-query-ex-res.R which uses https://github.com/hrbrmstr/awsathena/blob/master/R/s3-download-file.R for the S3 download.
I was overly conservative with this line https://github.com/hrbrmstr/awsathena/blob/master/R/s3-download-file.R#L37 and am going to make it a parameter so it can get on-par with the awscli.
internally we have a pkg with this function:
function(query_execution_id, continuous=TRUE) {
repeat {
awsathena::get_query_execution(
query_execution_id, profile = "awsaml-181646978271"
) -> out
if (!continuous) break
if (out$state[1] %in% c("SUCCEEDED", "FAILED", "CANCELLED")) {
if (out$state[1] == "FAILED") warning("Query failed")
if (out$state %in% c("SUCCEEDED", "CANCELLED")) {
cost <- (out$data_scanned / 1024 / 1024) * 0.000004768
if (cost < 0.00004768) cost <- 0.00004768
message("Cost = ", scales::dollar(cost))
}
break
} else {
message(out$state[1])
Sys.sleep(5)
}
}
return(out)
}
which supports the polling and spits back the S3 CSV in the return object list and I kinda generally just copy that to an awscli cmd #lazy
Since it doesn't get metadata it does mean relying on the intelligence of CSV parsers. That can be remedied with a call to the get-query-results API endpoint tho (and a length 0 max items)
So I've been doing some googling about PyAthenaJDBC while trying to triage #16, came across this by @davoscollective:
https://medium.com/@davedecahedron/ive-tested-both-and-pyathenajdbc-is-a-lot-slower-i-suppose-partly-because-it-is-using-the-athena-fdf56a9b715
this is just one anecdote, and he is writing about the 1.x version of the driver, but it may be worth exploring.