rstudio / pins-r

Pin, discover, and share resources
https://pins.rstudio.com
Other
312 stars 63 forks source link

Steps towards Data Version Control for data / modelling sources - idea / feature request #766

Closed iandarbeynhiu closed 1 year ago

iandarbeynhiu commented 1 year ago

Hi,

I would like to suggest a feature I've begun to implement by wrapping the pin reading and writing functions in a broader function but could potentially be implemented directly in pins to support comprehensive version control.

For each pin read via the wrapper function a named list is added to (or created on the first read) which captures the pin name, the board path and the version.

For each pin written by the wrapper function that named list is then added as metadata. Along with this I've also included steps to check if the script has been saved and if a commit is needed to potentially reject the pin write (with the script name and commit details also added to the metadata at writing).

The end result for any pin write using my wrapper function is a pin with the source data pin name/s, the board the source data came from and the version of the pin used, along with the script name and associated git details.

While the script and git elements may be considered much less generalisable (reliance on RStudio and git) I think there's some merit to an internal data item which could be optionally populated as pins are read and optionally populated in to the metadata on a pin write.

At the moment my metadata list is returned and updated in the global environment on each pin read but this may be more beneficial as an internal object in the pins package.

I can share the code for my two wrapper functions when I get back to my work laptop on Monday.

juliasilge commented 1 year ago

Happy to take a look at some specifics when you get a chance!

iandarbeynhiu commented 1 year ago

Hi Julia,

The two functions are really a pair that go hand in hand (I've done some further work on them since the above comment too).

The "read" pin.

NHIU_pin_read <- function(board_pin = NULL, pin_name = "Data Dictionary", pin_version = NULL, data_version_inputs_list = data_version_inputs) {

  data <- pins::pin_read(board = board_pin, name = pin_name, version = pin_version )

  pin_version <- if(is.null(pin_version)) pins::pin_meta(board_pin, pin_name)$local$version else pin_version

  if(!exists("data_version_inputs")) data_version_inputs_list <- list()

  data_version_inputs_list[[pin_name]] <- c("Path" = board_pin$path, "Source Pin Version" = pin_version)

  data_version_inputs <<- data_version_inputs_list

  return(data)
}

The "write" pin function

NHIU_pin_write <- function(board_pin, data, pin_name, metadata_list, file_format, force_identical_write = TRUE,
                           git_check = T) {

  rstudioapi::documentSave(id = NULL)
  if(all(isTRUE(git_check), !identical(system("git status -s", intern = TRUE), character(0)))) stop("SAVE AND COMMIT YOUR CODE")

  metadata_list <- list("Script" = basename(rstudioapi::getActiveDocumentContext()$path),
                        "Commit" = system("git rev-parse HEAD", intern=TRUE),
                        "SHA" = substr(system("git rev-parse HEAD", intern=TRUE),1,7),
                        "R Version" = version$version.string)

  package_dependency <- sessioninfo::package_info() %>% # option to add "attached" to filter to just attached packages
    # currently show attached and associated dependencies
    dplyr::select("Package Name" = package, "Package Version" = loadedversion,
                  "Package Date" = date, "Package Source" = source, "Attached" = attached)

  pins::pin_write(board = board_pin,
            x = data, name = pin_name,
            metadata = c(data_version_inputs,metadata_list, package_dependency),
            type = file_format,
            force_identical_write = force_identical_write)

  ### Ultimately output_dependency should be written to a DB or agreed location
  ### Similar for script_dependency
  output_depenedency <- dplyr::tibble("Data Pin" = pin_name,
                               "Data Version" = pins::pin_meta(board_pin, pin_name)$local$version,
                               bind_rows(metadata_list),
                               "Source Data Pin" = names(data_version_inputs)) %>%
    bind_cols(bind_rows(data_version_inputs))
  script_dependency <- tibble(bind_rows(metadata_list),
           package_dependency)

  if(git_check) {
    write_csv(script_dependency, file = "script_dependencies.csv", append = file.exists("script_dependencies.csv"))
    write_csv(output_depenedency, file = "output_depenedencies.csv", append = file.exists("output_depenedencies.csv"))
  }

  return(list("Script Dependencies" = script_dependency, "Data Dependencies" = output_depenedency))
}

Ideally the data_version_inputs list would be stored away from the global environment to prevent user editing but that's a nice to have I suppose. Whether the script dependencies piece is appropriate for inclusion in the pins package is for consideration too. This is an active development for me at the moment so at the moment the write pin function writes the board and then returns a list of the 2 dependency data frames. Ultimately for our purposes we'd look to write the dependencies either to a DB or specific file for our script / data dependencies as appropriate.

The specific avenue taken here is obviously catering to our needs but I think there's some potential for generalisation within the pin_read and pin_write functions and an data versioning list within the pins package (if you think the concept is worth adding to pins (I do!). It relies on boards with versioning enabled naturally.

juliasilge commented 1 year ago

Thanks for sharing this! This is definitely along the lines of our advice for storing custom metadata with pins. Since this is mostly about metadata and not a specific kind of board, this is functionality that we want to make possible in a flexible way, but not provide really specific functions for in pins itself. I can see folks writing functions like this in internal R packages at their company or for their own use, for sure.

I like the idea of suggesting these kinds of metadata options for folks (filename, package versions, git SHA, etc). Let's update that metadata vignette with some "real world" examples of metadata to be stored.

iandarbeynhiu commented 1 year ago

Yes the aim of these is to be incorporated in to an internal R package. I read that metadata vignette and decided to see if I could trace data sources from input to output and store everything that "made" the output in it's metadata. Initially focused on the data (i.e. the pin reads) but then realised a string of text is a string of text so should really store the script and git commit stuff. Goal internally will be to write the 2 dependencies to two database tables once we are happy with the core concept and have all the variables we want to capture.

Happy to contribute to a vignette.

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.