Use async queries API instead

nfultz commented 5 years ago

So I've been doing some googling about PyAthenaJDBC while trying to triage #16, came across this by @davoscollective:

https://medium.com/@davedecahedron/ive-tested-both-and-pyathenajdbc-is-a-lot-slower-i-suppose-partly-because-it-is-using-the-athena-fdf56a9b715

I have tested with various clients (Tableau, DBeaver, and basic Java app) and retrieving data is a lot slower than it should be. When you run a query with the AWS Athena console, a results csv is written very quickly to S3. I did a test with a single table of 15m rows. The Athena query and csv file to S3 completes in less than 2 minutes. I can then download it to my local machine in less than a minute. To do the same operation via the JDBC driver takes over 2 hours. I am now thinking a better strategy is to query the metadata then start the Athena query asynchronously, poll it until completion and then download the csv file directly from the s3 staging directory and combine with metadata for correct types.

this is just one anecdote, and he is writing about the 1.x version of the driver, but it may be worth exploring.

daroczig commented 5 years ago

I have a similar anecdote on the impala driver at http://datascience.la/r-and-impala-its-better-to-kiss-than-using-java :D But it turned out the JDBC driver was OK, the R wrapper was slow. Anyway, I think this is a promising idea, especially that we have really cool CSV parsers in R, so tweaking the column classes should be fairly easy. On the other hand, the classes will be auto-guessed by default instead of using the actual colclasses of the table or you plan to write some wrappers to do that properly?

hrbrmstr commented 5 years ago

One way I got around this was in https://github.com/hrbrmstr/awsathena/blob/master/R/collect-async.R (tidyverse-oriented) but once you get a query execution ID I have https://github.com/hrbrmstr/awsathena/blob/master/R/download-query-ex-res.R which uses https://github.com/hrbrmstr/awsathena/blob/master/R/s3-download-file.R for the S3 download.

I was overly conservative with this line https://github.com/hrbrmstr/awsathena/blob/master/R/s3-download-file.R#L37 and am going to make it a parameter so it can get on-par with the awscli.

internally we have a pkg with this function:

function(query_execution_id, continuous=TRUE) {

  repeat {

    awsathena::get_query_execution(
      query_execution_id, profile = "awsaml-181646978271"
    ) -> out

    if (!continuous) break

    if (out$state[1] %in% c("SUCCEEDED", "FAILED", "CANCELLED")) {
      if (out$state[1] == "FAILED") warning("Query failed")
      if (out$state %in% c("SUCCEEDED", "CANCELLED")) {
        cost <- (out$data_scanned / 1024 / 1024) * 0.000004768
        if (cost < 0.00004768) cost <- 0.00004768
        message("Cost = ", scales::dollar(cost))
      }
      break
    } else {
      message(out$state[1])
      Sys.sleep(5)
    }

  }

  return(out)

}

which supports the polling and spits back the S3 CSV in the return object list and I kinda generally just copy that to an awscli cmd #lazy

Since it doesn't get metadata it does mean relying on the intelligence of CSV parsers. That can be remedied with a call to the get-query-results API endpoint tho (and a length 0 max items)

nfultz / AWR.Athena

Use async queries API instead #18