paws-r / paws

Paws, a package for Amazon Web Services in R
https://www.paws-r-sdk.com
Other
320 stars 37 forks source link

Question: alternative to writeBin when downloading large files from s3 #242

Closed DyfanJones closed 1 year ago

DyfanJones commented 4 years ago

Hi All,

This is more of a question rather than an issue with the paws sdk.

I have come into a slight issue around downloading large Data files from S3, and I wondered if either of you know an alternative method.

The problem is with base::writeBin. base::writeBin is restricted to: Only 2^31 - 1 bytes can be written in a single call (and that is the maximum capacity of a raw vector on 32-bit platforms). This means for large files that exceed this it will return and error:

Error in writeBin(obj$Body, con = File) : 
  long vectors not supported yet: connections.c:4418

My current solution is to chunk up the raw connection and call writeBin to append to a file (https://github.com/DyfanJones/noctua/pull/61). However this would cause a slight duplication of the data every time I chunk up the raw connection.

Example

# helper function to chunk up raw connection
write_bin <- function(
  value,
  filename,
  chunk_size = 2L^20L
) {
  total_size <- length(value)
  start_byte <- 0L
  while(start_byte < total_size) {
    end_byte <- min(start_byte + chunk_size, total_size) - 1L
    this_chunk <- value[seq(start_byte, end_byte, by = 1)]
    con <- file(filename, "a+b")
    writeBin(this_chunk, con)
    close(con)
    start_byte <- start_byte + chunk_size
  }
}

s3 <- paws::s3()

# Download the file and store the output in a variable
s3_download <- s3$get_object(
  Bucket = "my_bucket,
  Key =  "large_s3_file.csv"
)

# Write output to file
file_name2 <- "large_s3_file.csv"
write_bin(s3_download_body, con = file_name2)

If either of you know an alternative method that would be amazing. If not I will keep looking around to see if there is a more elegant solution.

davidkretch commented 4 years ago

We don't have a solution to this either, but we're going to be looking into it soon as well since we're going to be working on large downloads from S3. We'll let you know if we come upon a good solution, and if you find one please let us know as well!

DyfanJones commented 4 years ago

The best method I have found so far is:

write_bin <- function(
  value,
  filename,
  chunk_size = 2L ^ 20L) {

  # create splits for value
  total_size <- length(value)
  split_vec <- seq(1, total_size, chunk_size)

  # create connection to file
  con <- file(filename, "a+b")
  # close connection on exit
  on.exit(close(con))

  # write Binary to file
  sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
}

This has a significant increase in speed however I feel there has to be a faster solution:

Raw vector size: 4.4Gb
while loop method: 
# user  system elapsed 
# 96.00   59.83  214.59 

sapply method:
# user  system elapsed 
# 35.81   12.11   51.45
DyfanJones commented 4 years ago

It looks like R 4.0.0 may look into fixing writeBin: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/97

davidkretch commented 4 years ago

Thank you! We will try your approach.

DyfanJones commented 4 years ago

Just found a possible alternative. readr has a function write_file that allows raw to be written out to file. So for example:

library(readr)

write.csv(iris, "iris.csv")

# Raw object
obj <- readBin("iris.csv", "raw", n = file.size("iris.csv"))

# writing out raw object as a csv file
write_file(obj, "iris2.csv")

I will do a speed test after work to see how it compares to write_bin.

Note: The only problem with this method is that it will only work for flat file outputs.

DyfanJones commented 4 years ago

A quick speed test:

library(readr)

X <- 1e8

df <- 
  data.frame(
             w = runif(X),
             x = 1:X,
             y = sample(letters, X, replace = T), 
             z = sample(c(TRUE, FALSE), X, replace = T))

# write test data.frame
write_csv(df, "test.csv")

write_bin <- function(
  value,
  filename,
  chunk_size = 2L ^ 20L) {

  total_size <- length(value)
  split_vec <- seq(1, total_size, chunk_size)

  con <- file(filename, "a+b")
  on.exit(close(con))

  if (length(split_vec) == 1) writeBin(value,con) else sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
  invisible(TRUE)
}

system.time(obj <- readBin("test.csv", "raw", n = file.size("test.csv")))
# user  system elapsed 
# 1.141   3.397   7.145 

system.time(obj <- read_file_raw("test.csv"))
# user  system elapsed 
# 2.947   8.147  24.245 

format(object.size(obj), units = "auto")
# 3.3 Gb

system.time(write_file(obj, "test2.csv"))
# user  system elapsed 
# 0.569   2.118   3.440 

system.time(write_bin(obj, "test3.csv"))
# user  system elapsed 
# 30.275  19.424  55.037 

It looks like base R readBin is fairly quick and doesn't need replacing, However readr::write_file is really quick and ~10x faster. I think readr has given us a good alternative. Sadly that would mean an extra dependency for the noctua package. But a speed up like this can't be ignored :D

DyfanJones commented 4 years ago

Here is the performance test using microbenchmark

library(readr)
library(microbenchmark)

# creating some dummy data for testing
X <- 1e8
df <- 
  data.frame(
             w = runif(X),
             x = 1:X,
             y = sample(letters, X, replace = T), 
             z = sample(c(TRUE, FALSE), X, replace = T))

read_csv(df, "test.csv")

# writeBin looping function
write_bin <- function(
  value,
  filename,
  chunk_size = 2L ^ 20L) {

  # remove if table exists
  if(file.exists(filename)) unlink(filename)
  total_size <- length(value)
  split_vec <- seq(1, total_size, chunk_size)

  con <- file(filename, "a+b")
  on.exit(close(con))

  if (length(split_vec) == 1) writeBin(value,con) sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
  invisible(TRUE)
}

# read in text file into raw format
obj <- readBin("test.csv", what = "raw", n = file.size("test.csv"))

format(object.size(obj), units = "auto")
# 3.3 Gb

microbenchmark(R_loop = write_bin(obj, "test2.csv"),
               readr = write_file(obj, "test3.csv"),
               times = 20)

Unit: seconds
   expr       min        lq      mean    median        uq       max neval
 R_loop 40.837055 43.483143 45.752667 45.169562 47.717823 51.129576    20
  readr  2.144268  2.576486  3.067961  2.669331  2.721574  7.622492    20
DyfanJones commented 4 years ago

I reached out to the author of qs, he kindly wrote some Rcpp code and did a benchmark test https://github.com/traversc/qs/issues/30 . Just looping you guys in just in case you wanted to pursue his method.

davidkretch commented 4 years ago

Awesome, thank you so much. These are going to be super helpful. We're going to be trying these in the next few weeks.

DyfanJones commented 1 year ago

Closing as paws supports s3$download_file which downloads file without going through r