[Question] How to identify the latest pin in an s3_board

jeffkeller87 commented 9 months ago

A follow-up to this post and similar to https://github.com/rstudio/pins-r/issues/590.

I love how I can use {pins} instead of maintaining my own artifact management process. It really cuts down on the amount of boilerplate code in my projects!

However, I often have a need to read pins from a system where installing either the R or Python {pins} package is not possible. In my case, these systems are ephemeral continuous integration runners with a limited set of software installed. Specifically, I am grabbing the latest model artifacts from S3 to COPY into a Docker image.

My current solution is to write artifacts to a latest/ prefix (or directory, if S3 is not the storage media) in addition to a timestamped prefix. If the storage media is a filesystem, I sym- or hard-link latest/ to the appropriate timestamped directory. The structure looks something like this:

repo
└── artifact
    ├── 2023‐09‐22T11:41:35Z
    ├── 2023‐09‐23T11:42:53Z
    └── latest

From a system without {pins}, I can then reference a static path to get the latest artifacts.

aws s3 sync s3://repo/artifact/latest .
cp -R .../repo/artifact/latest .

Without {pins}, is there a straightforward way to identify the latest pin version in a board?

juliasilge commented 9 months ago

Thanks for this question @jeffkeller87! I think the short answer is "no" because we haven't built either R or Python pins with an eye toward being used by directly from a shell or similar, but since it's all just files and directories, certainly you have options:

The versions for an S3 board (or GCS, or Azure, etc) are a timestamp pasted together with a truncated hash, so they will sort in the correct order. You can use that sensible naming scheme to get the latest version with something like ls | tail -1.
Have you checked out the new-ish manifest file functionality? This lets you be more explicit in which versions you want to track and use. You can read more in this vignette, and I could see reading that YAML file with something like yq to find the latest version, then opening up that directory.
I think your current approach of copying what you want into a /latest directory is great too! Maybe it is the most straightforward for your situation.

I don't think it's likely that pins starts keeping a /latest directory since that's not the main use case we're targeting, but certainly you can use the directory structure (maybe together with a manifest file) to manage this from a shell in a couple of ways.

jeffkeller87 commented 9 months ago

Thank you @juliasilge for the very thoughtful response. I agree that replicating the /latest copy / link within pins probably isn't the right thing to do. However, if there is room to improve the interop surface of pins, I think that would be worth pursuing.

To that end, is the naming schema sufficient for determining the latest version? I figured the truncated hash would cause issues if more than one pin version was written within the same second. That's probably good enough for what I'm doing, but I can see it causing issues in other scenarios. Do you have a strong preference for the hash over sub-second markers?

The manifest file was the other option I considered. My optimism deflated a bit when I saw it was a YAML file rather than JSON--only because of how long it took me to convince my Infrastructure / IT people to install jq in our runner image. Theoretically, I could get them to install yq too :)

juliasilge commented 9 months ago

Oh yes, you are definitely correct that the timestamp doesn't distinguish between versions written within the same second. This has come up before and to date, the only time this has been a problem is in kind of "fake" situations, like when building a vignette or when people are writing tests in other packages that use pins. We haven't heard of problems with the timestamp in people's real work, since most folks are pinning, say, a model binary or a summarized dataset coming out of an ETL pipeline. Folks are generally not using pins for super high performance writing, at least so far.

In your use case, would subsecond information be practically important?

## what we do now:
format(Sys.time(), "%Y%m%dT%H%M%SZ", tz = "UTC")
#> [1] "20230926T161828Z"

## we could do something like:
format(Sys.time(), "%Y%m%dT%H%M%OS2Z", tz = "UTC")
#> [1] "20230926T161828.26Z"

^{Created on 2023-09-26 with reprex v2.0.2}

jeffkeller87 commented 9 months ago

In my cases, there should be no chance of a sub-second temporal collision like that. But there's always those unexpected scenarios where another writer sneaks in at just the wrong time, and then pulling hair figuring out what happened when the pin you just wrote isn't the one that gets read immediately after (using the ls | tail method).

Modifying the timestamp format would shrink the probability further, but it makes specifying an explicit version more onerous in pin_read().

I think the current behavior is fine as-is. If someone is writing this frequently intentionally, they probably don't want a versioned board anyway.

juliasilge commented 9 months ago

That makes a lot of sense. I'm going to leave this issue open for discussion in case other folks come by with this same need in the near future; we can reevaluate as we hear more on it. Thanks again for the question @jeffkeller87!

juliasilge commented 2 months ago

It sounds like we haven't seen a high need for improvements in this area so I am going to close this issue. We can revisit in the future if we hear more from users on this! 🙌

github-actions[bot] commented 1 month ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

rstudio / pins-r

[Question] How to identify the latest pin in an s3_board #790