rstudio / pins-r

Pin, discover, and share resources
https://pins.rstudio.com
Other
312 stars 63 forks source link

Be more explicit about how pins saves arrow data #605

Closed sellorm closed 2 years ago

sellorm commented 2 years ago

When people not using the pins package see a pin with an arrow data set, it is not immediately obvious what format that actually is.

For reference, the format used when type = "arrow" is actually 'feather'.

From the pin_write() help:

type               File type used to save x to disk. Must be one of "csv", "rds", "json", "arrow", or "qs". 
                       If not supplied will use json for bare lists and rds for everything else.

This is confusing for arrow users since there is no formal on-disk format called '.arrow'. Arrow users, generally use either '.parquet' or '.feather'.

pins should either:

  1. Change this type name and file extension to 'feather'
  2. Or be more explicit about the fact that what pins calls "arrow" is really "feather" under the hood.

If pins chooses to adopt the second of these approaches then it would be nice to highlight that it's feather in both the package help, as well as in the metadata somewhere.

This would hopefully reduce the support burden and help increase cross-language adoption.

iainmwallace commented 2 years ago

It would be useful to add an option to allow storing as parquet also

nealrichardson commented 2 years ago

For what it's worth, the IANA extension for Arrow data is .arrow, so that's not invalid, though .feather is also commonly used. See also https://arrow.apache.org/faq/.

Parquet is a different format, though the arrow libraries can read and write Parquet.

sellorm commented 2 years ago

Thanks for the clarification @nealrichardson. The FAQ you linked recommends the .arrow extension without mentioning .feather. Would you recommendation therefore be that pins stays as it is?

Perhaps we could just tweak the docs a little to make things clearer for users.

nealrichardson commented 2 years ago

I'm not sure of the details, or whether the type parameter directly maps onto file extension. But saving files with a .arrow extension is not a problem.

machow commented 2 years ago

Thanks for all the discussion / resources. If I'm understanding, naming the pin type arrow is in line with how apache-arrow sees things (feather v2 is Arrow IPC format), and using .arrow extension too.

It seems like--based on the docs--if we had to choose between "arrow" and "feather", that "arrow" is where people are being steered toward (e.g. it recommends .arrow as the extension).

It seems like this might be the move for pins...

machow commented 2 years ago

I've switched pin-python to support type="arrow" (https://github.com/rstudio/pins-python/releases/tag/v0.5.0). Thanks y'all for ironing out!

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.