Consider changing default format for dataframes to arrow or CSV

juliasilge opened 1 year ago

juliasilge commented 1 year ago

We have seen users who write using the default from R, and then are frustrated when their Python colleagues can't read. We have considered changing to arrow for a long time:

Does arrow have enough usage in the community for this to be reasonable? It would be a much better choice if interoperability is one of the main reasons people use pins (to read with Python).

juliasilge commented 1 year ago

We recently moved arrow to Suggests in #646 so this would likely mean some new users would be prompted to install another package, even when using the defaults.

machow commented 1 year ago

Adding arrow as a requirement seems like it could introduce some friction (maybe?). I wonder if the audience for pins might lean toward CSV (for example, this pins blog post aims at an audience that is emailing CSVs, so maybe emailing CSV -> stashing CSV with pins might feel like a smaller step?).

(This is me mostly thinking of pins as a very early stepping stone for data versioning / sharing, since I'd personally be very into storing everything in arrow/parquet!)

iainmwallace commented 1 year ago

I would suggest csv as the default. We often share via Connect and it is frustrating for non R/Python users when they go to the connect landing page for that dataset and they can't download the file in a format they can understand or open easily.

juliasilge commented 1 year ago

Reading CSV via read.csv() often has downsides, like guessing that goes wrong, not handling dates, etc. If we consider changing the default to CSV, would it be better (less surprising overall, easier collaboration with Python folks, etc) to use vroom for reading and writing?

wibeasley commented 1 year ago

I agree about the downsides of csvs, especially the lack of explicit variable types. When pins saves a csv, could it save a second file that stores the variable info? Essentially a serialized/dput-ed readr::col_types object?

I don't like having to redefine (a) integer vs floating, and (b) factor levels.

If the data is later imported by pins, pins would look for the metadata and use it. But the csv is still valid and can be read by other programs that don't know how to interpret the "mtcars.readr_col_types" plain-text file. The metadata file isn't critical -it's just optional gravy.

juliasilge commented 1 year ago

@wibeasley That is an interesting suggestion! As of now, we would recommend that folks follow this vignette for managing custom formats, like reading CSVs with more control:


b <- board_temp()

penguin_col_spec <- as.character(readr::as.col_spec(penguins))
#> [1] "ffddiifi"

b %>% 
    type = "csv",
    metadata = list(col_spec = penguin_col_spec)
#> Creating new version '20230223T212321Z-809e9'
#> Writing to pin 'very-nice-penguins'

new_col_spec <- pin_meta(b, "very-nice-penguins")$user$col_spec
pin_download(b, "very-nice-penguins") %>%
  readr::read_csv(col_types = new_col_spec)
#> # A tibble: 344 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
#>    <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
#>  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
#>  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
#>  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
#>  4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
#>  5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
#>  6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
#>  7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
#>  8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
#>  9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
#> 10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
#> # … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#> #   ²​body_mass_g

Created on 2023-02-23 with reprex v2.0.2

Those last two bits could be wrapped up in a pin_read_col_spec() helper function for an individual to use, if they always wanted to set up their files this way.

leslem commented 5 months ago

An argument to specify a csv reading function (e.g. read.csv or readr::read_csv or data.table::fread) would be good for the use case that led me to this issue. Or passing arguments on to read.csv would be helpful.

I have a colleague who's writing pins from python as type='csv', and then I want to read them in R, but with read.csv under the hood I get column names modified (to be syntactic) and column types I don't want. For now I'm going to do pin_download() and then readr::read_csv() to get the data read in the way I'd like.

juliasilge commented 5 months ago

For now I'm going to do pin_download() and then readr::read_csv() to get the data read in the way I'd like.

This is definitely the right thing to do for now.

CSV writing can be so cantankerous, especially if you are using R and Python or something else. Have you talked with your colleague about considering switching to parquet? Is there a particular constraint that makes that not a good move?

juliasilge commented 5 months ago

I was just thinking about the problem reported by @leslem again today, and how it highlights that switching to CSV will not really solve all user pain around this issue.

In rstudio/pins-python#231 @isabelizimm added support for reading .rds files from Python, which means that Python users will be able to read rectangular data written from R with the current default. The rdata package which powers that PR uses the binary types of R objects which means it's kind of like a poor man's arrow. It really improves the situation. I would still recommend that R + Python collaborators use parquet, but with that change on the Python side, maybe we don't want to change the default format for dataframes, at least not without a lightweight option for reading/writing parquet.