rstudio / pins-r

Pin, Discover and Share Resources
https://pins.rstudio.com
Other
301 stars 62 forks source link

arrow 13.0.0 compatibility #761

Closed thisisnic closed 10 months ago

thisisnic commented 11 months ago

We're planning on releasing arrow 13.0.0 to CRAN in the next couple of weeks, and our revdepchecks flagged up some test failures with pins.

-- R CMD check results ----------------------------------------- pins 1.2.0 ----
Duration: 43.1s

> checking tests ...
  See below...

-- Test failures ------------------------------------------------- testthat ----

> library(testthat)
> library(pins)
> 
> test_check("pins")
Guessing `type = 'rds'`
> <http://127.0.0.1:49432/data.txt> is not cacheable
> <http://127.0.0.1:49432/x.rds> is not cacheable
> <http://127.0.0.1:49436/x/20230724T142915Z-c3943/data.txt> is not cacheable
> <http://127.0.0.1:49436/x/20230724T142915Z-c3943/x.json> is not cacheable
> <http://127.0.0.1:49440/y/20230724T142915Z-cba09/data.txt> is not cacheable
> <http://127.0.0.1:49440/y/20230724T142915Z-5026d/data.txt> is not cacheable
[ FAIL 3 | WARN 0 | SKIP 66 | PASS 183 ]

══ Skipped tests (66) ══════════════════════════════════════════════════════════
• On CRAN (52): 'test-board_connect_bundle.R:36:3',
  'test-board_connect_bundle.R:41:3', 'test-board_connect_server.R:22:3',
  'test-board_connect_server.R:31:3', 'test-board_connect_server.R:51:3',
  'test-board_connect_server.R:60:3', ???, ???, 'test-board_folder.R:9:3',
  'test-board_folder.R:34:3', 'test-board_folder.R:42:3',
  'test-board_folder.R:88:3', 'test-board_url.R:54:3',
  'test-board_url.R:154:3', 'test-board_url.R:168:3', 'test-board_url.R:180:3',
  'test-board_url.R:199:3', 'test-board_url.R:218:3', 'test-board_url.R:237:3',
  'test-legacy_board.R:2:3', 'test-legacy_datatxt.R:4:3',
  'test-legacy_datatxt.R:16:3', 'test-legacy_datatxt.R:25:3',
  'test-legacy_datatxt.R:35:3', 'test-legacy_local.R:43:3',
  'test-legacy_packages.R:23:3', 'test-legacy_registry.R:30:3',
  'test-meta.R:13:3', 'test-meta.R:20:3', 'test-meta.R:24:3',
  'test-pin-delete.R:11:3', 'test-pin-meta.R:2:3',
  'test-pin-read-write.R:35:3', 'test-pin-read-write.R:41:3',
  'test-pin-read-write.R:58:3', 'test-pin-read-write.R:80:3',
  'test-pin-read-write.R:88:3', 'test-pin-upload-download.R:17:3',
  'test-pin-upload-download.R:46:3', 'test-pin-upload-download.R:61:3',
  'test-pin.R:76:3', 'test-pin.R:108:3', 'test-pin_info.R:5:3',
  'test-pin_info.R:23:3', 'test-pin_info.R:33:3', 'test-pin_search.R:27:3',
  'test-pin_versions.R:8:3', 'test-pin_versions.R:18:3',
  'test-pin_versions.R:26:3', 'test-pin_versions.R:41:3',
  'test-pin_versions.R:67:3', 'test-pin_versions.R:91:3'
• board_azure() tests require PINS_AZURE_KEY (3):
  'test-board_azure_adls2.R:1:1', 'test-board_azure_blob.R:1:1',
  'test-board_azure_file.R:1:1'
• board_connect() tests requires `creds.rds` (4): 'test-board_connect.R:2:1',
  'test-board_connect_url.R:4:3', 'test-board_connect_url.R:22:3',
  'test-board_connect_url.R:33:3'
• board_gcs() tests require PINS_GCS_PASSWORD (1): 'test-board_gcs.R:1:1'
• board_ms365() tests require PINS_MS365_TEST_DRIVE (1):
  'test-board_ms365.R:1:1'
• board_s3() tests require PINS_AWS_ACCESS_KEY, PINS_AWS_SECRET_ACCESS_KEY (1):
  'test-board_s3.R:1:1'
• legacy_azure() tests require TEST_AZURE_CONTAINER, TEST_AZURE_ACCOUNT,
  TEST_AZURE_KEY (1): 'test-legacy_azure.R:16:1'
• legacy_gcloud() tests require TEST_GOOGLE_BUCKET (1):
  'test-legacy_gcloud.R:12:1'
• legacy_github() tests require TEST_GITHUB_REPO, TEST_GITHUB_BRANCH (1):
  'test-legacy_github.R:1:1'
• legacy_s3() tests require TEST_AWS_BUCKET, TEST_AWS_KEY, TEST_AWS_SECRET,
  TEST_AWS_REGION (1): 'test-legacy_s3.R:14:1'

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-pin-read-write.R:12:3'): can round trip all types ────────────
pin_read(board, "df-2") (`actual`) not equal to `df` (`expected`).

`class(actual)`:   "tbl_df" "tbl" "data.frame"
`class(expected)`:                "data.frame"
── Failure ('test-pin-read-write.R:15:3'): can round trip all types ────────────
pin_read(board, "df-2") (`actual`) not equal to `df` (`expected`).

`class(actual)`:   "tbl_df" "tbl" "data.frame"
`class(expected)`:                "data.frame"
── Failure ('test-pin-read-write.R:18:3'): can round trip all types ────────────
pin_read(board, "df-3") (`actual`) not equal to `df` (`expected`).

`class(actual)`:   "tbl_df" "tbl" "data.frame"
`class(expected)`:                "data.frame"

[ FAIL 3 | WARN 0 | SKIP 66 | PASS 183 ]
Error: Test failures
Execution halted

1 error x | 0 warnings v | 0 notes v

In short, a change in https://github.com/apache/arrow/pull/35173 constitutes a breaking change - we now return tibble object instead of data.frame objects from our read_* functions. I was going to submit a PR, but I wasn't sure whether it'd be tests or code that would make sense to update.

juliasilge commented 11 months ago

Thanks so much for the detail here! 🙌

I think that what we want to do here is strip off the tibble classes when we read, for consistency with how everything else works in pins. Notice, for example, the tests that will fail with the new arrow release: https://github.com/rstudio/pins-r/blob/main/tests/testthat/test-pin-read-write.R It seems weird in the pins context to get back out a tibble when you stored a data.frame. I'm open to discussion here, though.

machow commented 11 months ago

It seems weird in the pins context to get back out a tibble when you stored a data.frame. I'm open to discussion here, though.

I think from the other side too--if you write a tibble to csv, you'll currently get back a data.frame. I can't really say what R folks want to happen, but if a tibble feels like a "subclass" of a data.frame (e.g. liskov substitutable; w/e that means here, since I know the way they get printed is also important), I wonder if that might make it feel okay to return?

juliasilge commented 11 months ago

I am realizing that if we strip off the tibble subclass when reading, then if someone stores a tibble as parquet, it will come back as a data.frame, which would definitely feel real bad. 😬

(I don't think it's reasonable to keep track of the classes here in pins metadata, like "this was originally a tibble", given how arrow now behaves.)

github-actions[bot] commented 10 months ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.