rstudio / pins-r

Pin, Discover and Share Resources
https://pins.rstudio.com
Other
301 stars 62 forks source link

Only create a new pin if no other previous version matches? #826

Open venpopov opened 2 months ago

venpopov commented 2 months ago

I was surprised by this behavior:

library(pins)

board <- board_temp(versioned = TRUE)

board |> pin_write(mtcars, "mtcars")
#> Guessing `type = 'rds'`
#> Creating new version '20240422T010932Z-a4a15'
#> Writing to pin 'mtcars'
Sys.sleep(1)
board |> pin_write(mtcars[1:5, ], "mtcars")
#> Guessing `type = 'rds'`
#> Creating new version '20240422T010933Z-45238'
#> Writing to pin 'mtcars'
Sys.sleep(1)
board |> pin_write(mtcars, "mtcars")
#> Guessing `type = 'rds'`
#> Creating new version '20240422T010934Z-a4a15'
#> Writing to pin 'mtcars'
Sys.sleep(1)
board |> pin_write(mtcars[1:5, ], "mtcars")
#> Guessing `type = 'rds'`
#> Creating new version '20240422T010935Z-45238'
#> Writing to pin 'mtcars'

board |> pin_versions("mtcars")
#> # A tibble: 4 × 3
#>   version                created             hash 
#>   <chr>                  <dttm>              <chr>
#> 1 20240422T010932Z-a4a15 2024-04-22 03:09:32 a4a15
#> 2 20240422T010933Z-45238 2024-04-22 03:09:33 45238
#> 3 20240422T010934Z-a4a15 2024-04-22 03:09:34 a4a15
#> 4 20240422T010935Z-45238 2024-04-22 03:09:35 45238

Created on 2024-04-22 with reprex v2.1.0

In the third step, I am saving the same object as in the first. Normally, if it is the same as the most recent, it is not rewritten.

I would like to use pins to track data objects produced by a research pipeline, in which I might change branches to try out different features. With the current behavior, a new object will be saved every time I rerun a pipeline after switching branches, which is unnecessary file duplication.

To fix this (which could be through an option setting), pin_write should check the hash not only for the last version, but for all cached versions. Then to make sure that pin_read() will work correctly, it would need to update the "created" field (or perhaps a new "reactivated" field?) so that the appropriate version is considered the most recent.

juliasilge commented 2 months ago

Thanks for this feedback @venpopov!

Let's use this issue to collect thoughts on this type of change. There isn't a great workaround right now for your particular use case because it is hard for folks to manually check the hash themselves ahead of writing; pins:::pin_hash() is both unexported and uses paths as the arg, which isn't entirely easy for a user to get at.