rstudio / pins-r

Pin, discover, and share resources
https://pins.rstudio.com
Other
312 stars 63 forks source link

support manifest-files for `board_url()` #650

Closed ijlyttle closed 1 year ago

ijlyttle commented 2 years ago

fix #627

This is just a placeholder for now, as I sketch out some ideas.

The current signature for board_url() is:

board_url <- function(urls, cache = NULL, use_cache_on_failure = is_interactive()) 

where urls is a named character vector of URLs ...

I know the proposal was to add a manifest argument - having made some sketches, it seems to me that this would be a complementary argument to urls, i.e. you would use one or the other. That being the case, would you consider overloading the urls argument?

I am also thinking to provide a first version of a manifest-writing function. I can provide one for board_folder(), but I don't know enough (anything, really) about the other back ends to write for any of them.

ijlyttle commented 2 years ago

I could be getting way ahead of myself here, but I had another look at pins-python's board_urls():

def board_urls(path: str, pin_paths: dict, cache=DEFAULT, allow_pickle_read=None):

This seems closer to being able to support a manifest: path could point to a url directory with a pins.txt manifest, and pin_paths could be left unspecified.

I don't know how you might value R <=> Python correspondence vs. backwards compatibility, but I suspect you have to support backwards compatibility.

There does appear to be an escape hatch, though - but it might be a cheap trick. Given that the boards are named a little bit differently: board_url() vs. board_urls(), could there be a new R board: board_urls(), which would support a manifest and versioning, and hew more closely to its Python counterpart?

juliasilge commented 2 years ago

The idea of a new board_urls() is tempting but I know we didn't end up with a difference there on purpose 😆 so it might be better in the long run to make a Python board_url() and gradually deprecate the existing Python function. @machow thoughts?

Now that I am looking at this more closely, I think that adding a manifest argument runs into the same problem of dependencies between arguments that I brought up in #649 so if we decide to keep board_url() only, I am in favor of:

  • unnamed character scalar, i.e. a single URL: this would be the URL to the manifest file.
  • named character vector, i.e. what's there now: no change in behavior (but no support for versions)
  • named list where elements are unnamed character vectors: this would be supplying the manifest directly, including support for versions.
ijlyttle commented 2 years ago

Super!

I will rework things along these lines, and keep an eye open for ideas from @machow

machow commented 2 years ago

so it might be better in the long run to make a Python board_url() and gradually deprecate the existing Python function

Yesss -- sorry, the name difference between R and python pins is one I should have nipped in the bud! 😅

ijlyttle commented 2 years ago

I've got a first draft of the "reading" part of the work - a board_url() can take an URL that points to a manifest file, and support versions:

library("pins")

# looks for pins.txt
url <- "https://raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/"

url_board <- board_url(url)

pin_list(url_board)
#> [1] "mtcars-csv"  "mtcars-json"

pin_versions(url_board, "mtcars-json")
#> # A tibble: 2 × 3
#>   version                created             hash 
#>   <chr>                  <dttm>              <chr>
#> 1 20220811T155803Z-c2702 2022-08-11 10:58:03 c2702
#> 2 20220811T155805Z-c2702 2022-08-11 10:58:05 c2702

pin_meta(url_board, "mtcars-json")
#> List of 11
#>  $ file       : chr "mtcars-json.json"
#>  $ file_size  : 'fs_bytes' int 4.05K
#>  $ pin_hash   : chr "c2702fb156ba5c38"
#>  $ type       : chr "json"
#>  $ title      : chr "mtcars-json: a pinned 32 x 11 data frame"
#>  $ description: NULL
#>  $ created    : POSIXct[1:1], format: "2022-08-11 10:58:05"
#>  $ api_version: num 1
#>  $ user       : list()
#>  $ name       : chr "mtcars-json"
#>  $ local      :List of 4
#>   ..$ dir     : 'fs_path' chr "~/Library/Caches/pins/url/2bc8541a31222efa41742167e6878a1e"
#>   ..$ url     : chr "https:/raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-json/20220811T155805Z-c2702/"
#>   ..$ version : NULL
#>   ..$ file_url: chr "https:/raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-json/20220811T155805Z-c2"| __truncated__

pin_meta(url_board, "mtcars-json", version = "20220811T155803Z-c2702")
#> List of 11
#>  $ file       : chr "mtcars-json.json"
#>  $ file_size  : 'fs_bytes' int 4.05K
#>  $ pin_hash   : chr "c2702fb156ba5c38"
#>  $ type       : chr "json"
#>  $ title      : chr "mtcars-json: a pinned 32 x 11 data frame"
#>  $ description: NULL
#>  $ created    : POSIXct[1:1], format: "2022-08-11 10:58:03"
#>  $ api_version: num 1
#>  $ user       : list()
#>  $ name       : chr "mtcars-json"
#>  $ local      :List of 4
#>   ..$ dir     : 'fs_path' chr "~/Library/Caches/pins/url/cfcb2a81d2d44dc0e81e85175589fb52"
#>   ..$ url     : chr "https:/raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-json/20220811T155803Z-c2702/"
#>   ..$ version : NULL
#>   ..$ file_url: chr "https:/raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-json/20220811T155803Z-c2"| __truncated__

pin_read(url_board, "mtcars-json") |> head()
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Created on 2022-09-15 by the reprex package (v2.0.1)

I wanted to give you a chance to react - there's still a lot to do in terms of documentation, testing, error messages, and also to propose a manifest-writing helper function.

Thanks!

ijlyttle commented 2 years ago

I have added some error-handling:

library("pins")
github_raw <- function(x) paste0("https://raw.githubusercontent.com/", x)

# failed request
board_url("https://not_real_url.posit.co")
#> Error in `board_url()`:
#> ! Error requesting manifest-file from URL
#>   <https://not_real_url.posit.co>:
#>   Received HTTP code 502 from proxy after CONNECT

# file not found
board_url(github_raw("rstudio/pins-r/master/tests/testthat/pin-rds/"))
#> Error in `board_url()`:
#> ! Error requesting manifest-file from URL
#>   <https://raw.githubusercontent.com/rstudio/pins-r/master/tests/testthat/pin-rds/pins.txt>:
#>   Not Found (HTTP 404).

# file not parsable YAML
board_url(github_raw("ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-csv/20220811T155805Z-48c73/mtcars-csv.csv"))
#> Error in `board_url()`:
#> ! Error parsing manifest-file at URL
#>   <https://raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-csv/20220811T155805Z-48c73/mtcars-csv.csv>:
#>   Parser error: did not find expected <document start> at line 1, column 6
#> ℹ Manifest file must be text and parsable as YAML.

# not supported format
board_url(3)
#> Error in `board_url()`:
#> ! `urls` must resolve to either:
#> • unnamed character scalar, i.e. a URL
#> • named character vector
#> • named list, where all elements are character scalars or vectors
#> ℹ `urls` is a "numeric".

# unnamed list
board_url(list("a", "b"))
#> Error in `board_url()`:
#> ! `urls` must resolve to either:
#> • unnamed character scalar, i.e. a URL
#> • named character vector
#> • named list, where all elements are character scalars or vectors
#> ℹ `urls` resolves to a list:
#> ℹ - named: FALSE
#> ℹ - all values character: TRUE

# list values not all character
board_url(list(a = "a", b = 2))
#> Error in `board_url()`:
#> ! `urls` must resolve to either:
#> • unnamed character scalar, i.e. a URL
#> • named character vector
#> • named list, where all elements are character scalars or vectors
#> ℹ `urls` resolves to a list:
#> ℹ - named: TRUE
#> ℹ - all values character: FALSE

# unnamed character vector
board_url(c("a", "b"))
#> Error in `board_url()`:
#> ! `urls` must resolve to either:
#> • unnamed character scalar, i.e. a URL
#> • named character vector
#> • named list, where all elements are character scalars or vectors
#> ℹ `urls` resolves to a character vector, but is unnamed.

Created on 2022-09-18 by the reprex package (v2.0.1)

ijlyttle commented 2 years ago

proposed function pin_manifest(), which can write mainfest files for board_folder() - if accepted, could expand to board_s3(), ...

library("pins")

board <- board_temp()
pin_write(board, mtcars, "mtcars-csv", type = "csv")
#> Creating new version '20220919T194404Z-48c73'
#> Writing to pin 'mtcars-csv'
pin_write(board, mtcars, "mtcars-json", type = "json")
#> Creating new version '20220919T194404Z-c2702'
#> Writing to pin 'mtcars-json'

pin_manifest(board)
#> Manifest file written to root-folder of board, as 'pins.txt'.
fs::path(board$path, "pins.txt") %>% readLines() %>% cat(sep = "\n")
#> mtcars-csv:
#> - mtcars-csv/20220919T194404Z-48c73/
#> mtcars-json:
#> - mtcars-json/20220919T194404Z-c2702/

Created on 2022-09-19 by the reprex package (v2.0.1)

Created on 2022-09-19 by the reprex package (v2.0.1)

juliasilge commented 2 years ago

This is looking good, and I think is headed the right direction! 🙌

I would like to see the pin_manifest() function more like pin_write() in that it is a regular function that calls methods for a specific board internally. Perhaps what is currently in pin_manifest_internal() mostly becomes the manifest creation function, and then we have a generic for pins_manifest_uploader. Maybe something like S3 can reuse the s3_upload_yaml() function we already have. That is more consistent with the current interface and also I think will let us more easily extend this and/or consider other arguments, like some that renv::snapshot() has such as prompt or update.

How confident are you that using a new yaml filename is the right call, vs. making a board-level data.txt, in addition to the pin-level data.txt files? Using the name of the file to decide whether we are looking at a board or a pin seems slightly iffy to me maybe?

FYI the legacy API had support for a manifest, and this led Javier to want to work on a data.txt spec. I don't think we are going to pursue that, but I do think it's good to look over that older work and think about how this problem has come up multiple times (obviously a need!).

ijlyttle commented 2 years ago

I need to think a bit more about the form of the function (your first point) before saying anything, but I did want to offer some rationale for why I proposed pins.txt.

I interpreted the name data.txt as a file to describe data, thus proposing pins.txt as a file to describe pins. I also see that data.txt is a reserved name, I didn't want to conflict with that.

For example, the current behavior of board_url() is to treat a directory that has a data.txt file as the directory of a pin-version:

https://github.com/rstudio/pins-r/blob/5c4beade1887200e72f290be1e0b08f4842daa0a/R/board_url.R#L83-L91

So if I called:

# incorrectly supplying the path to a board, rather than to a pin-version
board_url(c(file = "http://pins.com/path/to/board/"))

If data.txt were the name of the pins-manifest, I'd get some weird behavior when read_meta() tried to read such a file. If it were named pins.txt, I imagine I'd get a "cannot find data.txt" error, which could be easier to reason about.

Update: Reading more about Javier's data.txt spec - this looks really neat; I'll need some more coffee to wrap my head around this particular question.

ijlyttle commented 2 years ago

I think we were/are on the same page for extending pins_manifest() for other boards; I also think I got too cute with the implementation. Hopefully this is closer to what you have in mind :)

At some point, I'll want to propose tests - but I don't have accounts on these platforms. I imagine the CI does, though.

ijlyttle commented 1 year ago

Leave this alive (mostly dead?) for the time being?

juliasilge commented 1 year ago

Yeah @ijlyttle let's leave this open and draft for now. 👍

ijlyttle commented 1 year ago

This is all implemented in #661 and #681.

Big thanks to @juliasilge and @machow for being so patient and generous with your talents!

github-actions[bot] commented 1 year ago

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.