rstudio / pins-r

Pin, discover, and share resources
https://pins.rstudio.com
Other
316 stars 62 forks source link

could `pin_write()` support `type = "arrow"` uncompressed? #631

Closed ijlyttle closed 2 years ago

ijlyttle commented 2 years ago

This may also impact pins for Python, but I thought I would start here.

The motivation for this question is that Arrow for JavaScript does not support compression. I know there's a workaround using pin_upload() and pin_download(), but it seems to add friction for R/Python users.

To be clear, I'm not suggesting that any default behavior change; just to ask about the possibility for something like a compression = FALSE argument somewhere.

Here's a larger writeup describing my motivation, with a first hack on how a BoardUrl could work using JavaScript.

Thanks!

machow commented 2 years ago

Ah, that's too bad about compression in js :/ . Since pins stores metadata for everything, it seems like having a strategy for specifying save/load options there could be useful..

Maybe related in long-term to https://github.com/rstudio/pins-python/issues/18

Right now arguments aren't passed to savers when someone does pin_write, so a quick fix would be adding a new type (e.g. "uncompressed_arrow"?)

juliasilge commented 2 years ago

The way we handle the dots now from the user-facing function pin_write() passes them through to pin_store(), which means we can do things as outlined in #628 but we can't use the dots for this which happens in object_write().

I wonder if a new "uncompressed_arrow" type is perhaps the best way to go? Or should we try to support the different compress options that RDS has as well when considering this? We could add a new compression or compress argument to pin_write() after the dots without causing problems.

ijlyttle commented 2 years ago

Could something like write_args, a list with arguments to pass to the writing-function, work here? (I'm certain there's a better name)

object_write <- function(x, path, type = "rds", write_args = NULL) {
  type <- arg_match0(type, setdiff(object_types, "file"))

  switch(type,
    arrow = rlang::exec(arrow::write_feather, x, path, !!!write_args) 
    # you get the idea
  )

  path
}

This might present a problem where write_args wants to override an already-named argument, e.g.:

 json = jsonlite::write_json(x, path, auto_unbox = TRUE)
juliasilge commented 2 years ago

That could be a great option, I think. This would still have to go after the dots to avoid a breaking change; the function signature would end up looking like:

pin_write(
  board,
  x,
  name = NULL,
  type = NULL,
  title = NULL,
  description = NULL,
  metadata = NULL,
  versioned = NULL,
  ...
  write_args = list()
)
ijlyttle commented 2 years ago

I am on board!

At the risk of being pedantic, I wonder if the argument should be named .write_args, as it comes after the dots - I think this is to help avoid collisions with arg names that could be sent using the dots.

Happy to start on a PR, if you like.

machow commented 2 years ago

I think adding the ability for people to customize how pins are saved will be really useful--but would prefer we separate out the immediate value of supporting uncompressed arrow, from the more high risk / high reward task of how to provide general extension for save / load.

Looking at tools like vetiver, it's clear that how things are saved and how they're loaded are coupled to each other. My concern is that addressing save on its own may work for self-describing formats (e.g. Rds, arrow), but I also expect customizing write to create new surprising behaviors.

If we could empower devs to steward the save/load process (e.g. create a new type that can do w/e it wants with something like write_args), it can help identify who is responsible when extension goes awry (e.g. error hit while using writing "some_extension_type", so we know it is not a pins responsibility to fix).

ijlyttle commented 2 years ago

Holding for now :)

juliasilge commented 2 years ago

@ijlyttle Would you be open to doing a PR where we add a new type, perhaps called "custom" (what could be a better name?) here: https://github.com/rstudio/pins-r/blob/debc94529b8b73cda41ab50419d6276cd6c56d24/R/pin-read-write.R#L163

We would expect folks to write code like pin_write(board, x, "my-name", .f = arrow::write_feather(compression = "uncompressed")). The argument .f would be new, after the dots.

And then the entry in switch() for the new custom type: https://github.com/rstudio/pins-r/blob/debc94529b8b73cda41ab50419d6276cd6c56d24/R/pin-read-write.R#L126-L133 would be something like eval(rlang::call_modify(.f, x, path)). That seems to work pretty well:

library(rlang)

call <- quote(arrow::write_feather(compression = "uncompressed"))
path <- tempfile()
x <- mtcars
new_call <- call_modify(call, x, path)
eval(new_call)
head(arrow::read_feather(path))
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Created on 2022-08-18 with reprex v2.0.2

We maybe should do a bit more careful checking of the call and use something like call_match(new_call, new_call[[1]])? And put a nice error message for users who pass in something that won't work.

What do you think?

ijlyttle commented 2 years ago

I like it! And yes, happy to propose a PR!

Just so I don't forget, I'd want to be able to use pin_read() and have it read it like an Arrow file. This may be a bit ambitious, but I figured I'd get the thought out.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.