rstudio / pins-r

Pin, discover, and share resources
https://pins.rstudio.com
Other
312 stars 63 forks source link

New `force_identical_write` arg #735

Closed juliasilge closed 1 year ago

juliasilge commented 1 year ago

Addresses #589

This does not solve all our hashing issues (for example, the pin_hash still only looks at the pin contents and not the metadata) but it is a small step toward a better hashing situation.

After talking with users, I think that an opt-in new argument is the best way forward (rather than, say, a new function or new default behavior). This new check_hash arg comes after the dots and defaults to FALSE. It does require another read of pin_meta() to get the old hash.

library(pins)
b <- board_temp(versioned = TRUE)
b %>% pin_write(1:10, "nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Creating new version '20230426T224520Z-7f884'
#> Writing to pin 'nice-numbers'

## to change timestamp:
Sys.sleep(2)

b %>% pin_write(1:10, "nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Warning: The hash of pin "nice-numbers" has not changed and will not be stored.

Created on 2023-04-26 with reprex v2.0.2

With this PR as is now, pin_write() will still return the name invisibly but does not store the pin and generates a warning. Thoughts?

kmasiello commented 1 year ago

Tested and works as expected. If I'm a user, I can work with approach well, though I think the documentation should be explicit in noting that the hash is not affected by changed metadata. I like that it's an opt-in argument, which means I will be reading the docs to understand what I'm getting into. There might be situations where the user would want the status (pinned or didn't pin) to be returned in addition to the pin name. Every use case I envision though, I can noodle through how I'd determine the pinning status by other means. I think it would be wise to keep our ears open as to whether this is a need from the community that we'd implement in the future.

juliasilge commented 1 year ago

Thank you so much @kmasiello! 🙌

There might be situations where the user would want the status (pinned or didn't pin) to be returned in addition to the pin name.

This is a good point. Right now, pin_write() only returns the name (like "julia.silge/nice-numbers") but we have been discussing in #592 how to better organize what should be returned. I'll add this info to that issue and we can see what would work best.

Do we think that now in this PR pin_write() should return something different than the name if it does not update the pin? Or should we wait for broader changes to what is returned?

juliasilge commented 1 year ago

Refined the warning message a smidge:

library(pins)
b <- board_connect()
#> Connecting to Posit Connect 2023.03.0 at <https://colorado.posit.co/rsc>
b %>% pin_write(1:10, "julia.silge/really-nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/really-nice-numbers'
b %>% pin_write(1:10, "julia.silge/really-nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Warning: 
#> ! The hash of pin "julia.silge/really-nice-numbers" has not changed.
#> • Your pin will not be stored.
b %>% pin_write(1:11, "julia.silge/really-nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/really-nice-numbers'

Created on 2023-05-01 with reprex v2.0.2

kmasiello commented 1 year ago

Do we think that now in this PR pin_write() should return something different than the name if it does not update the pin? Or should we wait for broader changes to what is returned?

My gut says this should be handled with the broader considerations of what we return. I don't want to see a bespoke solution here that isn't compatible with future work.

kmasiello commented 1 year ago

Additional suggestion and comments after reviewing with users - check_hash may be better named as check_data_hash to emphasize that metadata is not checked.

Checking metadata for changes is an outstanding ask. A sample scenario is using check_hash = TRUE but wanting to update the metadata to reflect a "current as of" date. You would want to be able to indicate the pin is still fresh, even if the data hasn't changed.

juliasilge commented 1 year ago

A sample scenario is using check_hash = TRUE but wanting to update the metadata to reflect a "current as of" date.

Can I make sure I am understanding this? It sounds like users are expressing a desire to:

This would require a more extensive change than in this PR, will be pretty tough in the near term, and likely would require special handling of different backends (S3 vs. Connect vs. folders vs. ...). As of now, updating the metadata but not the pin contents isn't something straightforward to implement. We could consider how to approach this problem in the future.

I tend to think we should stick with check_hash as the argument name since we will likely think through checking portions of the metadata as well (title, description, tags, and user metadata could all currently change without changing the pin hash).

I tend to think we should still go with the change as scoped here because it lets us offer a solution to the storing-many-copies problem. It does mean folks have to choose between having the metadata reflect the last time the pin contents really changed and the last time a write was attempted; this still seems better than the situation as it is now.

kmasiello commented 1 year ago

Agreed. As currently scoped, this helps the main use case tremendously. If I'm very keen to only store the changed data frame but have a record of the last write attempt, I can bake this logging info into my upstream ETL process. Similar to the other comment about returning the "wrote / didn't write" status, it will be good to keep ears open to use cases around metadata changes.

github-actions[bot] commented 1 year ago

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.