ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

gittargets: data version control for the targets package #486

Closed wlandau closed 2 years ago

wlandau commented 3 years ago

Date accepted: 2022-01-12 Submitting Author Name: Will Landau Submitting Author Github Handle: !--author1-->@wlandau<!--end-author1-- Repository: https://github.com/wlandau/gittargets Version submitted: 0.0.0.9000 Submission type: Standard Editor: !--editor-->@adamhsparks<!--end-editor-- Reviewers: @smwindecker, @mdneuzerling

Due date for @smwindecker: 2021-12-20 Due date for @mdneuzerling: 2021-12-28

Archive: TBD Version accepted: TBD Language: en

Package: gittargets
Title: Data Version Control for the Targets Package
Description: Version control systems such as Git help researchers
  track changes and history in data science projects,
  and the 'targets' package (2021, <doi:10.21105/joss.02959>)
  minimizes the computational cost of keeping the latest results
  reproducible and up to date. The 'gittargets' package
  combines these two capabilities. The 'targets' data store
  becomes a version control repository and stays synchronized
  with the Git repository of the source code.
  Users can switch commits and branches
  without invalidating the 'targets' pipeline.
Version: 0.0.0.9000
License: MIT + file LICENSE
URL: https://wlandau.github.io/gittargets/, https://github.com/wlandau/gittargets
BugReports: https://github.com/wlandau/gittargets/issues
Authors@R: c(
  person(
    given = c("William", "Michael"),
    family = "Landau",
    role = c("aut", "cre"),
    email = "will.landau@gmail.com",
    comment = c(ORCID = "0000-0003-1878-3253")
  ),
  person(
    family = "Eli Lilly and Company",
    role = "cph"
  ))
Depends:
  R (>= 3.5.0)
Imports:
  cli (>= 3.0.0),
  data.table (>= 1.12.8),
  gert (>= 1.0.0),
  processx (>= 3.0.0),
  stats,
  targets (>= 0.8.1.9000),
  tibble (>= 3.0.0),
  utils,
  uuid (>= 1.0.0)
Suggests:
  knitr (>= 1.30),
  markdown (>= 1.1),
  rmarkdown (>= 2.4),
  testthat (>= 3.0.0)
Remotes:
  ropensci/targets
SystemRequirements: Git (>= 2.0.0)
Encoding: UTF-8
Language: en-US
VignetteBuilder: knitr
Config/testthat/edition: 3
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2

Scope

Version control systems such as Git help researchers track changes and history in data science projects, and the targets package minimizes the computational cost of keeping the latest results reproducible and up to date. The gittargets package combines these two capabilities. The targets data store becomes a version control repository and stays synchronized with the Git repository of the source code. Users can switch commits and branches without invalidating the targets pipeline.

gittargets is for people who use targets for reproducible analysis pipelines and Git for tracking the code files of those pipelines.

Packages gh, gert, git2r, and git2rdata all work with Git from R, but in a way that is more general, less opinionated, and less targets-specific than gittargets. gittargets establishes a separate repository for the targets data that is linked to the project's code repository. This allows the code and data to synchronize without affecting the efficiency or the size of the code repository.

N/A

N/A

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [n/a] The package is novel and will be of interest to the broad readership of the journal. - [n/a] The manuscript describing the package is no longer than 3000 words. - [n/a] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

ropensci-review-bot commented 3 years ago

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot commented 3 years ago

:rocket:

Error: Issue template has no 'repourl'

:wave:

ropensci-review-bot commented 3 years ago

Missing values: author1, repourl

wlandau commented 3 years ago

Sorry about that, should be fixed now.

ldecicco-USGS commented 3 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 3 years ago

Thanks, about to send the query.

ropensci-review-bot commented 3 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 3 years ago

Checks for gittargets (v0.0.0.9000)

git hash: b6eadc4b

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE


1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 15 files) and - 1 authors - 1 vignette - no internal data file - 9 imported packages - 9 exported functions (median 23 lines of code) - 94 non-exported functions in R (median 6 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:-----------------------|-----:|----------:|:----------| |files_R | 15| 70.6| | |files_vignettes | 0| 0.0|TRUE | |files_tests | 17| 94.1| | |loc_R | 663| 53.5| | |loc_tests | 417| 68.5| | |num_vignettes | 1| 60.7| | |n_fns_r | 103| 71.7| | |n_fns_r_exported | 9| 38.9| | |n_fns_r_not_exported | 94| 76.9| | |n_fns_per_file_r | 4| 51.7| | |num_params_per_fn | 4| 54.3| | |loc_per_fn_r | 7| 23.4| | |loc_per_fn_r_exp | 23| 54.1| | |loc_per_fn_r_not_exp | 6| 29.1| | |rel_whitespace_R | 6| 24.3| | |rel_whitespace_tests | 7| 76.1| | |doclines_per_fn_exp | 62| 74.1| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 123| 80.4| | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![github](https://github.com/wlandau/gittargets/workflows/check/badge.svg)](https://github.com/wlandau/gittargets/actions) [![github](https://github.com/wlandau/gittargets/workflows/lint/badge.svg)](https://github.com/wlandau/gittargets/actions) **GitHub Workflow Results** |name |conclusion |sha |date | |:-------|:----------|:------|:----------| |check |success |b6eadc |2021-11-17 | |cover |success |b6eadc |2021-11-17 | |lint |success |b6eadc |2021-11-17 | |pkgdown |success |b6eadc |2021-11-17 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 13.35 The following files are not completely covered by tests: file | coverage --- | --- R/tar_git_checkout.R | 0% R/tar_git_init.R | 0% R/tar_git_log.R | 0% R/tar_git_ok.R | 0% R/tar_git_snapshot.R | 0% R/tar_git_status_code.R | 0% R/tar_git_status_data.R | 0% R/tar_git_status_targets.R | 0% R/tar_git_status.R | 0% R/utils_assert.R | 0% R/utils_git.R | 30.77% #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.2.72 | |pkgcheck |0.0.2.107 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

wlandau commented 3 years ago

gittargets requires command line Git in order for git-lfs to work (related: https://github.com/r-lib/gert/issues/160). The majority of tests will automatically be skipped if tar_git_ok() returns FALSE. Is there still a path forward?

ldecicco-USGS commented 3 years ago

Yeah, that makes sense. I'll get started on getting this moving.

mpadge commented 3 years ago

And I'll check out why coverage might have failed tomorrow, although git is definitely installed. @wlandau Nothing you need to worry about, thanks.


Edit: For the record, @wlandau helped debug there, and the bot checks now give coverage of 100% :rocket: :100:

wlandau commented 3 years ago

Thank you both!

Related: I have not been able to correctly set up Git on the Windows GitHub Actions for gittargets. I tried a bunch of different approaches, but I still get errors claiming I need to configure git config --global user.name and git config --global user.email. The other runners appear to work just fine though, including the one that checks coverage.

ldecicco-USGS commented 3 years ago

@ropensci-review-bot assign @adamhsparksas editor

ldecicco-USGS commented 3 years ago

@ropensci-review-bot assign @adamhsparks as editor

ropensci-review-bot commented 3 years ago

Assigned! @adamhsparks is now the editor

adamhsparks commented 3 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 3 years ago

Thanks, about to send the query.

ropensci-review-bot commented 3 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 3 years ago

Checks for gittargets (v0.0.0.9000)

git hash: 5c046f1d

Package License: MIT + file LICENSE


1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 15 files) and - 1 authors - 1 vignette - no internal data file - 9 imported packages - 9 exported functions (median 23 lines of code) - 96 non-exported functions in R (median 6 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 15| 70.6| | |files_vignettes | 1| 64.8| | |files_tests | 17| 94.1| | |loc_R | 669| 53.8| | |loc_vignettes | 143| 62.1| | |loc_tests | 503| 72.2| | |num_vignettes | 1| 60.7| | |n_fns_r | 105| 72.3| | |n_fns_r_exported | 9| 38.9| | |n_fns_r_not_exported | 96| 77.3| | |n_fns_per_file_r | 4| 52.4| | |num_params_per_fn | 4| 54.3| | |loc_per_fn_r | 7| 23.4| | |loc_per_fn_r_exp | 23| 54.1| | |loc_per_fn_r_not_exp | 6| 23.5| | |rel_whitespace_R | 6| 24.9| | |rel_whitespace_vignettes | 43| 80.0| | |rel_whitespace_tests | 7| 78.4| | |doclines_per_fn_exp | 61| 73.3| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 132| 81.5| | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![github](https://github.com/wlandau/gittargets/workflows/check/badge.svg)](https://github.com/wlandau/gittargets/actions) [![github](https://github.com/wlandau/gittargets/workflows/lint/badge.svg)](https://github.com/wlandau/gittargets/actions) **GitHub Workflow Results** |name |conclusion |sha |date | |:------------|:----------|:------|:----------| |check |success |5c046f |2021-11-18 | |cover |success |5c046f |2021-11-18 | |cover-ubuntu |success |5c046f |2021-11-18 | |lint |success |5c046f |2021-11-18 | |pkgdown |success |5c046f |2021-11-18 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 100 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.51 | |pkgcheck |0.0.2.130 |


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

adamhsparks commented 2 years ago

Hi @wlandau, sorry for the delays. I'll endeavour to complete my checks today and start looking for reviewers so we can at least have some assigned before Christmas.

adamhsparks commented 2 years ago

Editor checks:

Editor comments

Thanks for this submission, Will. There's a few of us that were excited to see this one submitted for review. Everything looks good from here with my editor checks. I don't have any major suggestions so I'll start looking for reviewers now.


wlandau commented 2 years ago

Thanks, @adamhsparks! I am excited too.

adamhsparks commented 2 years ago

@ropensci-review-bot add @smwindecker to reviewers

ropensci-review-bot commented 2 years ago

@smwindecker added to the reviewers list. Review due date is 2021-12-20. Thanks @smwindecker for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 2 years ago

@smwindecker: If you haven't done so, please fill this form for us to update our reviewers records.

adamhsparks commented 2 years ago

@ropensci-review-bot add @mdneuzerling to reviewers

ropensci-review-bot commented 2 years ago

@mdneuzerling added to the reviewers list. Review due date is 2021-12-28. Thanks @mdneuzerling for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 2 years ago

@mdneuzerling: If you haven't done so, please fill this form for us to update our reviewers records.

adamhsparks commented 2 years ago

@smwindecker and @mdneuzerling, thank you both for agreeing to review {gittargets} for us. I’m aware that this review and the three weeks may have competing priorities for your time with the holidays and leave at the end of the year/beginning of the new year. With that in mind, please just keep me posted on your progress as you have time. If you get these finished by sometime in January that would be fine, but we’re pretty flexible here and understand that life happens so please do ask if you need more time for any reason.

wlandau commented 2 years ago

Yeah, I definitely don't expect any reviews until 2022. Please take all your time off!

By the way, I am actively using gittargets for a pipeline at work. I am developing a large Bayesian model for clinical data, and I have multiple Git branches to try different versions of the model and data preprocessing. The model takes hours to run, and gittargets lets me switch among multiple up-to-date pipelines when I switch branches.

smwindecker commented 2 years ago

thanks @adamhsparks. My goal was to spend the next few days tinkering so will make a start in any case. I'm using targets for my main work project but still feel a bit junior league on it, particularly related to how best to incorporate the models that are sent to HPC into my targets workflow. Look forward to spending the time on this, but it may be first weeks of Jan.

mdneuzerling commented 2 years ago

Hello @adamhsparks @wlandau. I've made a start on this. Apologies for the delay --- I had to get Christmas out of the way first!

I've started by looking through the code structure to get a rough idea of things. I'm making notes on a pull request, which I'll then compile into a formal review. This is a trick I've used before — check out the first commit of a repo into a separate branch, and then create a pull request onto it.

adamhsparks commented 2 years ago

Thanks for the update @mdneuzerling. No hurry, we’ve still another holiday to go and it is summer here in Oz after all.

mdneuzerling commented 2 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 5


Review Comments

The code in this package is to a very high standard. I have a few comments around documentation that I believe could help a new user approaching this package for the first time. However, these are only suggestions, and I strongly recommend that this project be submitted to CRAN. I have some other comments that are nitpicks and can definitely be ignored.

I created a very simple project to test the usage of this package. My code is minimal, and can be found at mdneuzerling/gittargets-peerreview.

General comments

Recommendations

Nitpicks

Session info

Click to expand! ``` ─ Session info 🤲🏾 🈁 🤯 ────────────────────────────────────────────────────────────────────────── hash: palms up together: medium-dark skin tone, Japanese “here” button, exploding head setting value version R version 4.1.0 (2021-05-18) os macOS Big Sur 11.3 system aarch64, darwin20 ui RStudio language (EN) collate en_AU.UTF-8 ctype en_AU.UTF-8 tz Australia/Melbourne date 2022-01-09 rstudio 1.4.1717 Juliet Rose (desktop) pandoc NA ─ Packages ──────────────────────────────────────────────────────────────────────────────────────────── package * version date (UTC) lib source askpass 1.1 2019-01-13 [1] CRAN (R 4.1.0) backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.1) base64url 1.4 2018-05-14 [1] CRAN (R 4.1.0) cachem 1.0.4 2021-02-13 [1] CRAN (R 4.1.0) callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0) cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.1) codetools 0.2-18 2020-11-04 [1] CRAN (R 4.1.0) crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1) credentials 1.3.2 2021-11-29 [1] CRAN (R 4.1.1) data.table 1.14.2 2021-09-27 [1] CRAN (R 4.1.1) desc 1.4.0 2021-09-28 [1] CRAN (R 4.1.1) devtools 2.4.0 2021-04-07 [1] CRAN (R 4.1.0) digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.1) ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) gert * 1.5.0 2022-01-03 [1] CRAN (R 4.1.1) gittargets * 0.0.0.9000 2022-01-08 [1] Github (wlandau/gittargets@6b3dada) glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.1) igraph 1.2.11 2022-01-04 [1] CRAN (R 4.1.1) knitr 1.37 2021-12-16 [1] CRAN (R 4.1.1) lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1) magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0) openssl 1.4.6 2021-12-19 [1] CRAN (R 4.1.1) pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.1) pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) pkgload 1.2.3 2021-10-13 [1] CRAN (R 4.1.1) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0) ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1) remotes 2.3.0 2021-04-01 [1] CRAN (R 4.1.0) rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1) rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) sessioninfo 1.2.1 2021-11-02 [1] CRAN (R 4.1.1) sys 3.4 2020-07-23 [1] CRAN (R 4.1.0) targets * 0.10.0 2022-01-07 [1] CRAN (R 4.1.0) testthat 3.1.0 2021-10-04 [1] CRAN (R 4.1.1) tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.1) tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) usethis 2.0.1 2021-02-10 [1] CRAN (R 4.1.0) utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) uuid 1.0-3 2021-11-01 [1] CRAN (R 4.1.1) vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.1) xfun 0.29 2021-12-14 [1] CRAN (R 4.1.1) yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) [1] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library ```
adamhsparks commented 2 years ago

Thanks for the comprehensive review @mdneuzerling.

smwindecker commented 2 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 7


Review Comments

Thank you for creating another useful addendum package to the targets family. I look forward to seeing this tool implemented in the suggested workflows for targets pipelines more broadly.

Statement of need:

tar_git_status()

Remotes

Vignette/instructions

image

Internal functions

Overall, my notes are quite minor and mostly related to improving user experience and uptake rather than real issues with the package. I love the idea and the execution is elegant, especially due to the function name parallels to the functions in gert (which I didn't know about before, so thank you!). I think developing a little more content about suggested best practices of how a targets user should integrate these steps into their workflow would make this contribution to the wider package family even more meaningful.

PS. the guitar in the hex is a very cute touch.

wlandau commented 2 years ago

Thank you @mdneuzerling and @smwindecker for your thoughtful reviews! I am working on responses and improvements to the package.

adamhsparks commented 2 years ago

@smwindecker, thank you for your thoughtful review.

wlandau commented 2 years ago

Response to @mdneuzerling

I feel like the first paragraph of the README doesn't adequately capture the use case of gittargets. What I'm looking to see in the first few paragraphs is something that covers what the targets data store is, and what you want to have happen to it when you switch branches with git.

I expanded the README in https://github.com/wlandau/gittargets/commit/11943f11b1f719e0f815f10e31f2290c8b6c514c to describe those assumptions.

In the project README or elsewhere, I recommend describing the intended workflow of the package. I settled on the below when checking out a different branch, which I think is right:

https://github.com/wlandau/gittargets/commit/ddc1e59f9560131e2aef971d1dfe122568762930 replaces man/figures/workflow.png with a simpler description of the workflow in the Usage section of the README and at the top of the git.Rmd vignette.

The assertions in tar_git_log allow for a value of max = Inf, which is a reasonable value that someone might use. This value is passed to gert::git_log, which will give a confusing error on max = Inf. Consider using is.finite or an equivalent assertion.

Added an assertion in https://github.com/wlandau/gittargets/commit/f12ee85401f607b582bd321283b6ecea21e95101.

I notice many tests are being skipped on Windows. I don't have ready access to a Windows machine. Can the author provide some confidence that this will work on Windows, or is the package intended to be Linux/Mac only? (I think we can safely disregard Solaris...)

When I unskip those tests on my local Windows machine, they all pass. But I have not been able to get them to work on GitHub Actions because the Windows runner claims the user name and email are not set, despite multiple attempted workarounds. As of https://github.com/wlandau/gittargets/commit/5acd4ecf648986b327d056abe4ed869aaf23b465 these tests are only skipped on CI Windows, not local Windows.

I would have preferred to use gert for all Git-related functionality because it installs well and just works on all platforms without any config. But gert relies on libgit2, which still silently fails with Git LFS unless the user makes some custom modifications: https://github.com/git-lfs/git-lfs/issues/375, https://github.com/libgit2/libgit2/issues/1473. So now gittargets requires a full Git client for operations that modify local Git repos. For reading stuff from Git, gittargets still uses gert because it gets the output into R quite nicely.

Regarding the vignette, the "Git data backend" is a local concept as far as I can tell. There's no use of a git remote here. I think for many people, git and GitHub are entirely synonymous, so it would be good to distinguish between the local and remote.

https://github.com/wlandau/gittargets/commit/bfe062009873cb20efe37700ee29d05cbbd9c610 adds a section on remotes and emphasizes that gittargets uses local Git only.

The run function from the processx package is in the NAMESPACE of gittargets. However in R/utils_git.R the author uses processx::run. Normally I wouldn't care, but run is a pretty commonly used function name, so I'd recommend keeping it out of the NAMESPACE if possible by using only processx::run.

I see what you mean, but importing process::run() in the NAMESPACE file only has consequences inside the gittargets private namespace, which does not have a conflicting internal function called "run". And it should not conflict with external code unless I re-export it. And think it is good practice to have NAMESPACE entries for DESCRIPTION Imports: packages (even with namespaced calls in the code, which I also do for the sake of defensive programming.)

tar_git_branch_snapshot and tar_git_commit_code refer to a "^code=" string; the meaning of this isn't entirely obvious (it's the prefix of the data branches). A comment in the code could clear this up.

Added comments in https://github.com/wlandau/gittargets/commit/5b7b1425fe5cb916c67dfd2283d14774f7c4b59a.

While it's not usually necessary to document internal functions, I think it would be a good idea to provide some Roxygen documentation for tar_git_snapshot_menu due to the large number of arguments it accepts. Recall that internal functions can be tagged with @keywords internal to prevent the help pages from being indexed.

Added a docstring in https://github.com/wlandau/gittargets/commit/541168f1c405dda019a4c05e0025a210f92a5327.

There are many SHA fingerprints in the diagrams which may make them harder to understand.

It may look like a lot to take in at first glance, but SHAs are important for understanding what is going on because of how the "code=" data branches match up with code commits. I kept them but simplified the figure in https://github.com/wlandau/gittargets/commit/37a20a75d7149bed889071570a308c4d56972aa9.

The function name tar_git_status_code confuses me. "Status code" makes me think of something like a numeric HTTP status code or Unix exit status, but this returns a tibble.

I get what you mean. I guess I could change it to a name like tar_git_status_code_repo(), but I am not 100% sold at the moment because it would interrupt the internal consistency of the function name convention.

wlandau commented 2 years ago

Response to @smwindecker

With that said, it took reading through most of the vignette and seeing the function outputs to really understand what the package was doing. The README and DESCRIPTION could use clearer statements, for example: "gittargets allows users to create version control snapshots of the data objects created in their pipeline." ... or something simple like that.

I rewrote the README and DESCRIPTION in https://github.com/wlandau/gittargets/commit/11943f11b1f719e0f815f10e31f2290c8b6c514c, hopefully they are clearer now.

Small note, at the first use (before a data snapshot has been made but after the git repo was initialised) we see “! Create the code repository with gert::git_init().” under data git status header, but it shouldn’t be there as a code repo already exists.

https://github.com/wlandau/gittargets/commit/ef8c0e07f87802ba3c6580b27b10377c514b7a83 improves the tar_git_status() messages.

Can a data snapshot be pushed to github, or some other place where it can be assigned, for example a "release" and downloaded? I imagine it could be useful in the scenario of a published/submitted manuscript to be able to push a single version of the snapshot associated with the final code push. Although it is naturally preferable that a reviewer recreate the pipeline from scratch for reproducibility purposes, for many slow pipelines that is unfeasible, and having a snapshot there for folks to examine and play with intermediate outputs would be fantastic as an option.

I think something like this should be possible if you create a Git-LFS-backed repo on GitHub or similar. Data snapshots are just branches, and you could either selectively push them to the remote repo, or you could push them all with git push --all; git push --tags. Then a code release could refer to a branch and commit in the remote data repo. gittargets itself tries not to go beyond local Git. But ultimately, this is the user's responsibility to manage outside gittargets.

Checking out a data snapshot defaults to checking out the latest in a given branch (yes?). I found myself more explicitly using the name of the data snapshot from the tar_git_log when checking it out to ensure I was selecting the appropriate one. Given this, it should be more obvious to users that data snapshots will, by default, be given the same name as the commit message of the code commit they are tied to, and that you can give a name to the snapshot.

As of https://github.com/wlandau/gittargets/commit/40ae23469486196628a426b7210fdf15beb8697c, the docstring of tar_git_checkout() is clearer. By default, tar_git_checkout() checks out the snapshot from the currently checked out code commit ("HEAD"), which is not necessarily the latest in the current branch (my mistake in the previous docstring). In other words, if you do a git reset --hard old_commit or a git checkout old_branch and then call tar_git_checkout(), then you could get a very old (but still correct) data snapshot without supplying the ref argument. The idea is to let the user focus on the code repo and trust tar_git_snapshot() to automatically find the right data. So ideally, the user should almost never need the ref argument. Does that make sense?

The idea of names of data snapshots is also a bit clouded by the "snapshot model" illustration, which shows the SHA instead of the commit name.

I added commit messages to the new figure in https://github.com/wlandau/gittargets/commit/37a20a75d7149bed889071570a308c4d56972aa9.

The instructions could use another step to clarify what happens in a merge. Specifically that after merging a branch, the data snapshot does not get overwritten with the branch snapshot, and if desired, should be checked out at that point. Similarly, clarifying that deleting a branch does not delete its associated data commit.

Yes, ordinary Git operations on the code repo do not change the data repo. https://github.com/wlandau/gittargets/commit/b26a7f5e64be8a728b8e0320ee94cc32f481a0dc adds a small section to the vignette on merges specifically.

Is there a way to clean up outdated data snapshots? tar_git_remove_snapshot? Or is this unnecessary because the point is a version control record? I just imagine it could get too large very fast.

I would have liked to implement this, but unfortunately it seems like an uphill battle against the intended behavior and design of Git. Because gittargets creates a large number of single-commit branches, the usual squash/rebase approach for deleting commits appears to breaks down. (I suppose git merge --squash comes close, but then that creates a new squash commit, and we would need to put that in a different data branch somehow, which seems messy.) An alternative I explored in https://github.com/wlandau/gittargets/issues/9 is to create an orphan branch for each data snapshot, but then we would lose the automatic audit trail of which targets changed between data snapshots, which I would prefer to keep for the sake of reproducibility.

In the gittargets documentation, I do recommend that snapshots only be created at important strategic milestones. This is the opposite of the usual advice for code version control (i.e. to commit early and often). The code and the data have such different needs when it comes to version control, and this fact was actually a major motivation for gittargets, which operates on entirely different Git repos for the code vs the data.

The "snapshot model" illustration shows a linear model of working with gittargets that does not match the vignette steps, which delves into branches. I think the addition of a simple image, in the style of the one below that many of us are familiar with, could attempt to illustrate the separate pathways of the code and data in a branching workflow.

There is a new figure in https://github.com/wlandau/gittargets/commit/37a20a75d7149bed889071570a308c4d56972aa9 which more closely aligns with the vignette and the style of https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell.

When I write into my console targit and hit tab, it shows me a range of functions that are not included in the exported functions (eg. tar_git_commit()) I found this confusing as I searched for the correct exported function to use, especially because these non exported but visible functions could not be searched with the ? command (naturally, they're not exported!) It gave me pause that I was missing some functionalities. I think these show up from the way the functions are documented as they sit in the same file as an exported function (I'm not at all sure this is the explanation).

This happens to me if I load gittargets with pkgload::load_all(), but not if I restart my R session and load it with library(gittargets). Does that help?

mdneuzerling commented 2 years ago

I think @wlandau's responses to my recommendations are more than adequate. In particular, the improvements to the README are superb. I don't have anything further to say about this package, other than that I look forward to seeing it on CRAN and using it in my own projects. Thank you so much for your contributions, @wlandau.

smwindecker commented 2 years ago

Thanks @wlandau for your responses.

The changes and explanations made much clearer the concept of the data snapshots as branches themselves. The relevance of this to understanding the snapshot 'ref's and how to call them is now clear. I appreciate the new figure and especially the accompanying explanation. And yes I had a local copy of the whole package and that's why I was seeing all the functions!

I am satisfied my review has been fully addressed. Look forward to sharing this with others.

adamhsparks commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/486#issuecomment-1008248927 time 5

ropensci-review-bot commented 2 years ago

Logged review for mdneuzerling (hours: 5)

adamhsparks commented 2 years ago

@ropensci-review-bot submit review time 7

ropensci-review-bot commented 2 years ago

Logged review for smwindecker (hours: 7)

adamhsparks commented 2 years ago

@ropensci-review-bot approve gittargets

ropensci-review-bot commented 2 years ago

Approved! Thanks @wlandau for submitting and @smwindecker, @mdneuzerling for your reviews! :grin:

To-dos:

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

adamhsparks commented 2 years ago

@mdneuzerling, I noted after reading your approval that you still hadn't ticked the "Statement of need" in your checklist. I'm assuming that this has been met since you've approved the package revisions in your most recent post regarding updates to the README. If this is correct, could you please just tic that box to complete the review for us? 🙏

mdneuzerling commented 2 years ago

@adamhsparks Sorry about that! Done.