Closed DyfanJones closed 1 year ago
We don't have a solution to this either, but we're going to be looking into it soon as well since we're going to be working on large downloads from S3. We'll let you know if we come upon a good solution, and if you find one please let us know as well!
The best method I have found so far is:
write_bin <- function(
value,
filename,
chunk_size = 2L ^ 20L) {
# create splits for value
total_size <- length(value)
split_vec <- seq(1, total_size, chunk_size)
# create connection to file
con <- file(filename, "a+b")
# close connection on exit
on.exit(close(con))
# write Binary to file
sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
}
This has a significant increase in speed however I feel there has to be a faster solution:
Raw vector size: 4.4Gb
while loop method:
# user system elapsed
# 96.00 59.83 214.59
sapply method:
# user system elapsed
# 35.81 12.11 51.45
It looks like R 4.0.0
may look into fixing writeBin
: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/97
Thank you! We will try your approach.
Just found a possible alternative. readr
has a function write_file
that allows raw
to be written out to file. So for example:
library(readr)
write.csv(iris, "iris.csv")
# Raw object
obj <- readBin("iris.csv", "raw", n = file.size("iris.csv"))
# writing out raw object as a csv file
write_file(obj, "iris2.csv")
I will do a speed test after work to see how it compares to write_bin
.
Note: The only problem with this method is that it will only work for flat file outputs.
A quick speed test:
library(readr)
X <- 1e8
df <-
data.frame(
w = runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
# write test data.frame
write_csv(df, "test.csv")
write_bin <- function(
value,
filename,
chunk_size = 2L ^ 20L) {
total_size <- length(value)
split_vec <- seq(1, total_size, chunk_size)
con <- file(filename, "a+b")
on.exit(close(con))
if (length(split_vec) == 1) writeBin(value,con) else sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
invisible(TRUE)
}
system.time(obj <- readBin("test.csv", "raw", n = file.size("test.csv")))
# user system elapsed
# 1.141 3.397 7.145
system.time(obj <- read_file_raw("test.csv"))
# user system elapsed
# 2.947 8.147 24.245
format(object.size(obj), units = "auto")
# 3.3 Gb
system.time(write_file(obj, "test2.csv"))
# user system elapsed
# 0.569 2.118 3.440
system.time(write_bin(obj, "test3.csv"))
# user system elapsed
# 30.275 19.424 55.037
It looks like base R readBin
is fairly quick and doesn't need replacing, However readr::write_file
is really quick and ~10x faster. I think readr
has given us a good alternative. Sadly that would mean an extra dependency for the noctua
package. But a speed up like this can't be ignored :D
Here is the performance test using microbenchmark
library(readr)
library(microbenchmark)
# creating some dummy data for testing
X <- 1e8
df <-
data.frame(
w = runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
read_csv(df, "test.csv")
# writeBin looping function
write_bin <- function(
value,
filename,
chunk_size = 2L ^ 20L) {
# remove if table exists
if(file.exists(filename)) unlink(filename)
total_size <- length(value)
split_vec <- seq(1, total_size, chunk_size)
con <- file(filename, "a+b")
on.exit(close(con))
if (length(split_vec) == 1) writeBin(value,con) sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
invisible(TRUE)
}
# read in text file into raw format
obj <- readBin("test.csv", what = "raw", n = file.size("test.csv"))
format(object.size(obj), units = "auto")
# 3.3 Gb
microbenchmark(R_loop = write_bin(obj, "test2.csv"),
readr = write_file(obj, "test3.csv"),
times = 20)
Unit: seconds
expr min lq mean median uq max neval
R_loop 40.837055 43.483143 45.752667 45.169562 47.717823 51.129576 20
readr 2.144268 2.576486 3.067961 2.669331 2.721574 7.622492 20
I reached out to the author of qs
, he kindly wrote some Rcpp
code and did a benchmark test https://github.com/traversc/qs/issues/30 . Just looping you guys in just in case you wanted to pursue his method.
Awesome, thank you so much. These are going to be super helpful. We're going to be trying these in the next few weeks.
Closing as paws supports s3$download_file which downloads file without going through r
Hi All,
This is more of a question rather than an issue with the
paws sdk
.I have come into a slight issue around downloading large Data files from S3, and I wondered if either of you know an alternative method.
The problem is with
base::writeBin
.base::writeBin
is restricted to:Only 2^31 - 1 bytes can be written in a single call (and that is the maximum capacity of a raw vector on 32-bit platforms).
This means for large files that exceed this it will return and error:My current solution is to chunk up the raw connection and call
writeBin
to append to a file (https://github.com/DyfanJones/noctua/pull/61). However this would cause a slight duplication of the data every time I chunk up the raw connection.Example
If either of you know an alternative method that would be amazing. If not I will keep looking around to see if there is a more elegant solution.