ropensci / piggyback

:package: for using large(r) data files on GitHub
https://docs.ropensci.org/piggyback
GNU General Public License v3.0
182 stars 26 forks source link

`pb_read` / `pb_write` to read/write directly into memory? #97

Closed tanho63 closed 8 months ago

tanho63 commented 1 year ago

It could be useful to read/write a single file directly into memory, rather than current required user workflow of download to tempfile + read. Would still need do this backend perhaps?

Problems:

so something like

pb_read <- function(filename, tag = "latest", repo = guess_repo(), read_function = "autodetect") {

  if(read_function == "autodetect") {
    read_function <- switch(
         file_ext(filename), 
         "csv" = read.csv, 
         "rds" = readRDS, 
         stop("could not autodetect file type...")
    )}

stopifnot(is_function(read_function))

# download to tempfile/raw in-memory
# read_function(tempfile)
}

implement something similar to write a single object

cboettig commented 1 year ago

maybe, but I think it is better for the user to access the download url and then select their own read function.

First, piggyback assets may potentially be quite large compared to available RAM, and there's a substantial and rapidly growing set of libraries able to work with those files on disk without ever reading the whole thing into memory (specifically, for large spatial assets using libraries like terra, stars, or sf, and for large tabular formats using libraries like arrow.)

Going a step further, such libraries now also make it possible to not only skip the 'read twice' pattern of downloading once to disk and reading to disk, but can let you skip ever reading the whole data file into R at all. e.g. spatial packages use GDAL's virtual file system. duckdb can perform a similar trick on parquet (and csv) files, allowing a user to leverage functions like dplyr::select() and dplyr::filter() directly on the remote data source to access only the subset of rows/columns they need. Subsetting data directly from a URL in this manner thus has the performance benefit of reading directly into memory (as proposed here) while also having the added benefit of allowing more efficient and bigger-than-RAM workflows. This is sometimes referred to as 'cloud-native' reads.

So in general I'm hesitant to hard-code autodetection of filetypes and read functions, since this doesn't generalize well to alternative file types etc. Consider this example, which uses duckdb to establish a remote table connection to a tsv file:

library(piggyback)
library(duckdb)
library(glue)
library(dplyr)

url <- pb_download_url(file = "diamonds.tsv.gz", repo="cboettig/piggyback-tests")

DBI::dbExecute(conn, "INSTALL 'httpfs';")
DBI::dbExecute(conn, "LOAD 'httpfs';")
conn <- DBI::dbConnect(duckdb())

tblname <- "diamonds"
view_query <-glue("CREATE VIEW '{tblname}' ",
                  "AS SELECT * FROM read_csv_auto('{url}');")
dbSendQuery(conn, view_query)

## now we have a lazy table connection, wow!
diamonds <- tbl(conn, tblname)

Okay, so the duckdb syntax is still a little more verbose than ideal, but note up to this point we haven't had to download more than a few bytes of what could potentially be a giant tsv file (or even be sharded into multiple files). Instead of read twice, we have reduced this read 0 times.

cboettig commented 1 year ago

ps the original comment

... # rest of the owl 

:joy: that made my day

tanho63 commented 1 year ago

šŸ˜šŸ˜šŸ˜ I was in a rush when I first jotted the idea down.

I totally get the performance gain/concern, I was more thinking about usage scenarios to wrap a pattern involving tempfile that Iā€™d use otherwise when reading/writing a smaller/more manageable file.

Great example of using duckdb though, I bet it would make for a nice vignette!

tanho63 commented 1 year ago

to my previous point, if a user wanted to read a parquet file, for instance, they could use pb_read like:

pb_read("myfile.parquet", read_function = arrow::read_parquet)

which would apply the user's read function to the downloaded tempfile.

I, ever lazy, only plan to use it like: pb_read("my_rds_file.rds") and pb_write(my_data_frame, "x.rds")

cboettig commented 1 year ago

yeah, good point about convenience wrappers! Like you say, using a tempfile has both performance cost (reads the bits twice) and convenience cost (requires two commands instead of one). I am generally all for convenience wrappers too, my brain just jumped to the performance part without thinking. Speaking of, the write method would still have to write to tempfile internally though, right? Or can you also 'stream' directly to upload? (and at least some serializations aren't streamable I think).

Obviously the duckdb example is like way too verbose for convenient use -- arrow syntax is better it just doesn't support arbitrary https sources for lazy reads.

I rather like the idea of convenient helper functions though. I think we just want to document their use in a way that doesn't encourage 'over-use', i.e. when objects are large, and/or intended for 'archival' use, I think users should fall back on the lower-level interface. (e.g. In general I think it's best to discourage folks from .rds as an archive format as it is less portable between languages/versions and doesn't have much performance gain against modern readers working on standard formats like compressed csv or parquet -- but for lightweight use I see the appeal of being able to stash an arbitrary R object as rds). I'm just a little nervous as I've seen other packages in this space (pins, datastorrr, etc) that all really lean into the idea of storing "R objects" rather than forcing users to think about serialization, and I see it lead to pretty poorly scaling code and sloppy data structures, instead of a well-considered data schema with a high-performance serialization in parquet...