ropensci / software-review

rOpenSci Software Peer Review.
290 stars 104 forks source link

submission: piggyback #220

Closed cboettig closed 5 years ago

cboettig commented 6 years ago

Summary

Package: piggyback
Version: 0.0.0.9000
Title: Managing Larger Data on a GitHub Repository
Description: Because larger (> 50 MB) data files cannot easily be committed to git,
  a different approach is required to manage data associated with an analysis in a
  GitHub repository.  This package provides a simple work-around by allowing larger
  (up to 2 GB) data files to piggyback on a repository as assets attached to individual
  GitHub releases.  These files are not handled by git in any way, but instead are
  uploaded, downloaded, or edited directly by calls through the GitHub API. These
  data files can be versioned manually by creating different releases.  This approach
  works equally well with public or private repositories.  Data can be uploaded
  and downloaded programmatically from scripts. No authentication is required to
  download data from public repositories.
Authors@R: person("Carl", "Boettiger",
                  email = "cboettig@gmail.com",
                  role = c("aut", "cre", "cph"),
                  comment=c(ORCID = "0000-0002-1642-628X"))
URL: https://github.com/cboettig/piggyback
BugReports: https://github.com/cboettig/piggyback/issues
License: GPL-3
Encoding: UTF-8
LazyData: true
ByteCompile: true
Imports:
    gh,
    httr,
    jsonlite,
    git2r,
    fs,
    usethis,
    crayon,
    clisymbols
Suggests:
    readr,
    covr,
    testthat,
    datasets,
    knitr,
    rmarkdown
VignetteBuilder: knitr
RoxygenNote: 6.0.1.9000
Roxygen: list(markdown = TRUE)

https://github.com/cboettig/piggyback

reproducibility, because accessing data being analyzed is essential for reproducible workflows, and yet we have no good solution for workflows with unpublished data or private workflows to do this once the data is too large for version control (e.g. files > 50 mb).

The target audience is anyone working with data files on GitHub.

datastorr on ropenscilabs is the closest match, which takes a very different approach (from the user perspective -- on the back end both store data on GitHub assets) to the essentially the same problem. The Intro vignette discusses at greater length many of the alternative possible strategies and why I feel they have all fallen short of my needs and led to me creating this package.

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

No errors, notes, or warnings.

Rich FitzJohn, @richfitz, would be great based on his experience in this area and with datastorr. Jenny Bryan, @Jennybc, since this package makes heavy use of usethis and GitHub interactions.

annakrystalli commented 6 years ago

Editor checks:


Editor comments

Hi @cboettig 👋, thanks for submission and great to be working with another of your packages. I believe I remember you talking about this problem. A nice solution to it!

Anyways, see below the result of my initial checks. I'm sure it's nothing that can't be fixed so I'm going to go ahead and approach reviewers. I'll take your suggestions onboard.

One small suggestion on the README, for storing GITHUB_PAT system variable, using edit_r_environ() and storing it there is also a nice suggestion.

checks

OK apart from 1 failling test.

tests

Testing piggyback
✔ | OK F W S | Context
✖ |  2 1   1 | Requiring Authentication [1.5 s]
──────────────────────────────────────────────────────────────────────────
test-with-auth.R:10: error: We can upload data
GitHub API error (404): 404 Not Found
  Not Found
1: pb_upload(repo = "cboettig/piggyback", file = "iris.tsv.gz", tag = "v0.0.1",
       overwrite = TRUE) at /Users/Anna/Documents/workflows/rOpenSci/editorials/piggyback/tests/testthat/test-with-auth.R:10
2: gh("DELETE /repos/:owner/:repo/releases/assets/:id", owner = x$owner, repo = x$repo,
       id = ids[i], .token = .token) at /Users/Anna/Documents/workflows/rOpenSci/editorials/piggyback/R/gh.R:157
3: gh_process_response(raw)

══ Results ═══════════════════════════════════════════════════════════════
Duration: 9.2 s

OK:       15
Failed:   1
Warnings: 0
Skipped:  1

I believe in you!

spell check

flagging potential typos:

  WORD               FOUND IN
commitish          pb_new_release.Rd:17
delimeters         pb_list.Rd:26
gess               pb_delete.Rd:10, pb_upload.Rd:13
prerelease         pb_new_release.Rd:32
transfered         pb_pull.Rd:27

coverage

Unable to check because of test failure.

annakrystalli commented 6 years ago

Also, could you also add the review badge to the README? 👍

[![](https://badges.ropensci.org/220_status.svg)](https://github.com/ropensci/onboarding/issues/220)
cboettig commented 6 years ago

@annakrystalli Thanks!

I'm not sure how to change the skip command to handle that case; currently it just skips the authentication-requiring tests if the token is undefined ("") (i.e. so it will skip on CRAN; travis runs these tests since I can give travis an encrypted token. Note the coverage will be lower than what codecov shows if you test locally due to these skips.

annakrystalli commented 6 years ago

Aha 🤔, not sure how feasible it is but it feels like the only people that test should run for are the repo owner and any collaborators. So perhaps the test condition check could include a check that the gh user matches one of owner or collaborator?

Also, all tests passing after disabling GITHUB_PAT 👌

annakrystalli commented 6 years ago

Also, current coverage,

piggyback Coverage: 0.36%
R/push-pull.R: 0.00%
R/utils.R: 0.00%
R/gh.R: 0.77%

As you mentioned, some tests didn't run. What sort of coverage are you getting?

annakrystalli commented 6 years ago

Sorry just found link to codecov. Ignore comment above.

richfitz commented 6 years ago

I'm not sure how to change the skip command to handle that case; currently it just skips the authentication-requiring tests if the token is undefined ("") (i.e. so it will skip on CRAN; travis runs these tests since I can give travis an encrypted token

@cboettig - I use "opt in" environment variables for tests like this, something like <PACKAGENAME>_USE_<FEATURE> and set that in my .Renviron and .travis.yml, then in the helper functions in test that have an equivalent skip_if_not_<feature> function that skips the test if the variable is not set to true

mpadge commented 6 years ago

Or universal local ones: I just set MPADGE_LOCAL=true. If I've got that, then GITHUB_PAT must be mine.

cboettig commented 6 years ago

Thanks all, that's brilliant. An after a few boneheaded attempts I managed to pass the secure+public env vars to travis correctly this time.

coatless commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [x] A short summary describing the high-level functionality of the software
  • [x] Authors: A list of authors with their affiliations
  • [x] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [x] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 1.5hrs


Review Comments

Environment Variable Access

The environment variable for the Personal Access Token (PAT), GITHUB_TOKEN, is not the common way the PAT is stored across development packages. Generally, the PAT in environment variables is given as GITHUB_PAT. Thus, an additional system environment check is warranted.

Note, that gh and devtools both prefer GITHUB_PAT.

https://github.com/r-lib/gh/blob/8677f7d287ef76c33f33831ea09a6df89b31f88e/R/gh_request.R#L143-L146

https://github.com/r-lib/devtools/blob/26c507b128fdaa1911503348fedcf20d2dd30a1d/R/github.R#L86-L106

I think it would also be advantageous to mimic the github_pat function from devtools in place of get_token():

https://github.com/cboettig/piggyback/blob/5c92c8753eda0c50f721ee2b39d0f03c8bc24b80/R/gh.R#L281-L283

Package API

The API for the package dances back'n'forth between treating the repository as the data and the data as a file. I can understand the use-cases for both; but, would it be better practice to emphasize the repository as the data? This would simplify the interface to mimic pre-existing APIs, e.g. devtools::install_* and download.files(url, dest). Alternatively, consider making the API more verbose, e.g. pb_download_repo(repo) or pb_download_repo_file(repo, file)

Package API:

  library("piggyback")
  lsf.str("package:piggyback")
#> pb_delete : function (repo = guess_repo(), tag = "latest", file, .token = get_token())  
#> pb_download : function (file = NULL, dest = ".", repo = guess_repo(), tag = "latest", 
#>     overwrite = TRUE, ignore = "manifest.json")  
#> pb_download_url : function (file = NULL, repo = guess_repo(), tag = "latest")  
#> pb_list : function (repo = guess_repo(), tag = "latest", ignore = "manifest.json")  
#> pb_new_release : function (repo = guess_repo(), tag, commit = "master", name = tag, 
#>     body = "Data release", draft = FALSE, prerelease = FALSE, .token = get_token())  
#> pb_pull : function (repo = guess_repo(), tag = "latest", overwrite = TRUE, manifest = ".manifest.json")  
#> pb_push : function (repo = guess_repo(), tag = "latest", overwrite = TRUE, manifest = ".manifest.json")  
#> pb_track : function (glob)  
#> pb_upload : function (file, repo = guess_repo(), tag = "latest", name = NULL, 
#>     overwrite = FALSE, .token = get_token())

Function Parameters

Another notable issue is the placement of the positional parameter after defaults.

e.g. file in pb_delete

pb_delete(repo = guess_repo(), tag = "latest", file, .token = get_token())

Consider putting file first or indicating missingness with file = NA.

Function Exports

It may be beneficial to also export and document guess_repo() as that is prominently featured in function signatures.

https://github.com/cboettig/piggyback/blob/5c92c8753eda0c50f721ee2b39d0f03c8bc24b80/R/push-pull.R#L261-L268

Documentation

Outside of that, it seems like there are a few artifact of an older API in the documentation examples and some exported functions are missing documentation:

Missing ex:

Outdated:

annakrystalli commented 6 years ago

Wow! I'm sure this might be some sort of record @coatless! 👏 Thanks so much for such a super speedy response!

@cboettig I'm still looking for the second reviewer. Feel to wait and respond to both reviews together or respond to @coatless anytime before.

coatless commented 6 years ago

@annakrystalli apologies for jumping the gun. I had a use case in mind when I saw the package. So, I just quickly took down a few notes.

annakrystalli commented 6 years ago

No worries at all @coatless! I'm not going to complain about having your review in! In fact, having a relevant use case handy often makes for a really useful review.

cboettig commented 6 years ago

@coatless thanks, this is great! How'd your use case go?

I've fixed the missing and outdated examples. I've made most of the examples into donttest and made sure they run successfully with devtools::check(cran = FALSE, run_dont_test = TRUE), though that requires my PAT since the tests upload data to cboettig/piggyback. Not sure if there's a better strategy for documentation than that -- obviously this means copy-pasting examples won't work, but examples need a repo they can upload to. Ideas / suggestions welcome.

Re the "file" vs "repo" interface, that's very intentional -- but perhaps I need to document it better. If I'm just trying to debug an example on some data, I'd rather use a file-based interface to quickly share the file over, say, a generic "scratch" repository (regardless of my current working repo, or if I'm not "in" any repo). Locally I can do:

pb_upload("mtcars.tsv.gz", repo  = "cboettig/scratch")
pb_download_url("mtcars.tsv.gz", repo  = "cboettig/scratch")

Then I can post a question like: "Hey Jim, why does this fail:"

read.table("https://github.com/cboettig/scratch/releases/download/data/mtcars.tsv.gz")

This needs the more file-based workflow of upload/download/download_url/list/delete. Other times, I'm working in a particular repository with data connected to that project, and I really want the more "git-lfs" style workflow with the ability to track, say, all *.tif or *.tsv.gz files. Does that make sense? Can I do a better job of explaining that somewhere?

Re GITHUB_PAT -- as you probably saw, piggyback will recognize that name too. But I do think that choice was one of Hadley's rare mistakes, GITHUB_TOKEN appears in some 86K code files on GitHub vs only 2K that use GITHUB_PAT, and gh package also recognizes both. I probably should do something like devtools does to provide a default public token though. I would import the devtools method directly but I'm hesitant to do so while the devtools diaspora still seems underway; not sure where that routine will end up. I've tried instead to build on usethis routines here whenever possible. Perhaps @jennybc has insight here?

Re exporting guess_repo(), I agree that's a useful pattern, I wish I could import it from usethis or git2r myself!, but I think I should keep it internal precisely because it's quite general -- I really wouldn't want people introducing a revdep on piggyback just for that function, and I don't want to guarantee that method is robust and general, rather than an internal method whose behavior might change.

jennybc commented 6 years ago

Re: GitHub PAT

FWIW in usethis I have basically inlined gh::gh_token() (which is now exported by dev gh):

https://github.com/r-lib/usethis/blob/3cc50ae7c4877d5f10747d0e8f25469897206cfe/R/github.R#L261-L266

It's not great but I don't know what else to do.

Re: guess_repo()

FWIW the tidyverse/r-lib team has Git remote URL parsing all over the place, which we would dearly love to centralize. We plan to make a mini-package for this (https://github.com/r-lib/ghurl/issues). It's seems incredibly specific but actually many other languages have such a thing. It actually comes up a lot! I think you could use your own function for now, but not encourage external use and hopefully we'll make this other thing available to ourselves and all.

cboettig commented 6 years ago

Thanks @jennybc, this is great. Yeah, I have almost the same inline call in

https://github.com/cboettig/piggyback/blob/ac5acf06f85e768347407c46b505792621fdb591/R/gh.R#L304-L306

I wonder if it would be worth including a default "public" token, the way devtools::github_pat() does, so that tasks not requiring auth don't run afoul of rate limiting on tokenless calls, or if that would create more confusion than benefit...

I'm :100:% for mini-packages for common utilities. much better than duplicating the same utility code internally or exporting that functionality from a package in which it's orthogonal to the actual purpose/API of the package. ghurl 👀

annakrystalli commented 6 years ago

Thanks to @mpadge for volunteering and using the package already, we now have both reviewers! 🎉


Reviewers: @coatless @mpadge Due date: 2018-07-12

annakrystalli commented 6 years ago

@cboettig also, I can't see the Under Review badge in the README anymore?

cboettig commented 6 years ago

@annakrystalli Karthik says badge server got restarted for upgrades, looks like it is back now.

mpadge commented 6 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

Package does not seem to have contributing guidelines in README or as separate file

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [x] A short summary describing the high-level functionality of the software
  • [x] Authors: A list of authors with their affiliations
  • [x] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

There are no references. Note that I previously submitted a reference-less paper.md to JOSS and was asked to please provide at least one reference.

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 5


Review Comments

This is a great and (for me at least) highly necessary package that makes my life so much easier. I am currently collaborating on a large project with multiple repos hosting code to generate very large data sets for which this package has provided an ideal solution to prevent repo-bloat. While this could all be done manually and laboriously, it is now possible to simply send links to collaborators who've got better things to do that learn the intricacies of git(hub), and write in the README:

library (our_data_packge)
download_all_data ()

That's awesome! Thanks @cboettig!

My major criticism

To get the negative out of the way first: The package as it stands is in many ways likely to be barely usable by non-coders because it issues almost no informative error messages. Almost all arguments appear to be passed un-checked directly to the github API which then returns a variety of largely unexpected and unintelligible http request errors.

Providing useful error message is crucial for productive use of any package, and would in this case have make the job of reviewing it considerably easier. Let me illustrate what I consider to be some inadequately processed errors:

pb_new_release(repo = "mpadge/myrepo", tag = "v1", .token = NULL)
# Error in pb_new_release(repo = "mpadge/junk", tag = "v1", .token = NULL) :
#   Not Found (HTTP 404).

Please issue a useful message about the necessity of local tokens.

pb_upload ("non-existent-file.Rds", repo = "mpadge/myrepo")
# Error in httr::upload_file(file) : file.exists(path) is not TRUE

Please intercept such errors prior to actually calling httr and relying on that package to issue your errors.

pb_upload ("existent-file.Rds", repo = "mpadge/myrepo",
           tag = "non-existent-tag")
# Error in gh("/repos/:owner/:repo/releases/tags/:tag", owner = r[[1]],  :
#               GitHub API error (404): 404 Not Found
#                 Not Found

That last one is particularly unhelpful, and would be much more useful with perhaps a dialog starting

# Tag 'v2' does not exist in that repository. Would you like it to be created?

(With suitable control over, or default values for, draft and pre-release options.)

pb_upload (file = "this-file-already-uploaded.Rds")
# Error in pb_upload(file = "data/x.Rds") :
#   Unprocessable Entity (WebDAV; RFC 4918) (HTTP 422).
pb_upload (file = "this-file-already-uploaded.Rds", overwrite = TRUE) # okay

The first of these should generate a message/warning about the overwrite parameter, rather than a 422 error. Finally,

pb_new_release (tag = "this-tag-already-exists")
# Error in pb_new_release(tag = "this-tag-already-exists") :
#   Unprocessable Entity (WebDAV; RFC 4918) (HTTP 422).

In short: Please pre-empt and process all errors within this package, rather than relying on external packages to detect and issue your error messages.

Back to the positive

Okay, now that that's out of the way, let me return to the positive tone in which I began. This really is a great and hugely useful package, and I hope that the remainder of this review can somehow contribute to making it even better.

Before proceeding, I note a general issue raised by @coatless regarding (in his words), the dance "between the repository as the data and the data as a file." I would extend that dance to a parallel and unavoidable one between git and github, and specifically between tags and releases. The package utilises the mechanism of github releases to aid enable clean interfaces with locally-hosted git repositories which only understand tags. I am unsure whether this is a package designed to interface with a github repo (via releases) or a local repo (via tags), although do not the useful section of the README/vignette in this context. Consider nevertheless that pb_new_release() will construct both a new tag on github, and a corresponding new release, yet this tag is not fetched locally, and so my local version of the repo will not know this tag.

While the package clearly has a primary github (rather than git) focus, it is nevertheless largely built around the guess_repo() function - which in my opinion, to strongly concur with @coatless, really ought to be exported - and this function is actually built on entirely local (git) queries (via git2r::discover_repository(). Given this, I suggest it would be useful to at least have some kind of optional (if not default) trigger to fetch all remote tags created with pb_new_release() to the local instance.

@cboettig also has a comment in the current code of the gh.R/pb_list() function:

Should we report the tag name too? x$tag_name?

To which I absolutely suggest Yes. It is perhaps useful to describe my current project to indicate perhaps a kind of use case not directly envisioned here. We have a variety of data representing a variety of stages from raw data to final output. There are several processing stages, and each of these now has - thanks to the ease and brilliance of this package - its own tag and corresponding github release. We therefore generally need to know at all times what the tags actually are, and with which tags each file is associated. This currently requires a command-line git interface to discern the tags, and the piggyback interface to list the files. Having these two combined within a single pb_list() call would be supremely useful here.

I also feel that pb_list() should by default list all files, enabling the kind of insight I just mentioned, with tag then restricting the view to only those associated with that tag (whether latest or not). It would also be very useful to have a simple pb_tags() function that listed all tags on the github repo.

Primary functionality

The package is divided between two modes, the explicit pb_up/download() and pb_push/pull(). I find both of these very useful, yet suspect that a merging of the two may ultimately be even more useful. The main difference actually seems to be the creation by the latter of a local .pbattributes file which is then used to track all files that are or can be pushed and pulled, enabling wholesale calls to pb_push/pull() without having to explicitly specify file names. On the other hand, pb_up/download() always require explicit filenames, and these calls never interact with .pbattributes at all.

Again, I do not think my current use case is likely to be particularly unusual, but we simply have several files of which only some are to be push/pull-ed or up/download-ed. This currently requires the latter, explicit interfaces, yet relying on these then prevents a simple pb_pull() call to download all data, which is a shame. One easy solution would be for pb_up/download() to also add entries to pb_push/pull(), so that pb_pull() could be used, but I suspect a nicer solution might ultimately be to have a single pair of calls (say, pb_up/download(), although i have no preference for nomenclature either way), such that pb_upload ("*.dat") functions as pb_push() currently does; pb_upload("x.dat") uploads that file only and adds the name to .pbattributes, and pb_upload() uploads all files listed on (or matching globs in) .pbattributes. Same for downloads.

Miscellaneous other notes

pb_delete

The following call simply and mysteriously completes silently:

pb_delete ("these-data-do-not-exist.Rds")

As does this one:

pb_delete ("these-data-do-exist.Rds")

Please issue an message in the first instance. It would also be useful to be able to:

pb_delete (tag = "v1")

with the expected results then manifest both locally and on github.

pb_push/pull

Note that pb_push/pull() generally behave as expected, yet if a file is deleted locally, then calling pb_pull() no longer displays the name of the file prior to displaying the progress bar. This inconsistency ought to be rectified.

Further note that if that file is again removed locally, then calling pb_push() simply completes silently. It would be useful in this instance to check any files explicitly listed in .pbattributes, and to issue some kind of informative messages that there are no local files to push.

I also suspect that some more thought might be productively put into exactly what role the .pbattributes file plays. If files are deleted locally, should they also be removed from .pbattributes? In this case, a call to pb_delete() could easily do all of these.

GITHUB_TOKEN

See discussion of nomenclature here, and my personally opinionated view that "GITHUB_PAT" is obfuscatory - "TOKEN" is a word; "PAT" is not.

tests

I note simply that when this package goes to CRAN, all tests will be skipped. I'm not sure of their current methods to actually examine the structure of tests, but it is certainly no longer acceptable to have a package without tests. If they actually examine whether tests are run (rather than just written), then the tests may need to be modified (purely in regard to CRAN, not at all in regard to this review).


Thanks @cboettig for a supremely useful and (lack of error messages notwithstanding) easy-to-use package. I look forward to it only become even easier to use through this review process.

annakrystalli commented 6 years ago

Thanks for the super timely review @mpadge and so great to see the package getting put to good use already and has managed to have some decent use cases thrown at it too.

@cboettig over to you!

cboettig commented 6 years ago

@mpadge Thanks for this, fantastic review.

annakrystalli commented 6 years ago

Hello @cboettig and @mpadge 👋.

Just wondering if any progress had been made with the issues highlighted in the review and feedback requested by @cboettig?

cboettig commented 6 years ago

Hey @annakrystalli , thanks for the ping. I think this is mostly waiting for me to finish out https://github.com/cboettig/piggyback/issues/8, which is about half-done on the timestamps branch (along with about 40% of the error message handling added there as well).

I take the thumbs up from @mpadge to mean he's okay with me leaving in push/pull and upload/download as separate APIs and not having the low-level wade into the additional complexity possible failure points of dealing with a manifest, but maybe I'm just wishing that interpretation on it.

Though they may not be an ideal solution, support for timestamps will work on the low-level API (as well as the high-level); might even be better than the manifest/hash approach of the push/pull API, effectively making it the simplification Mark is looking for?

I'll ping again once I've merged that and dealt with error messages, or if I have any additional questions about the error messages. Mark's list of specific failure modes is super helpful so I think I can at least improve on the errors and have a revised version to review by, say, the end of the month.

cboettig commented 6 years ago

@mpadge Okay, I think I've hit all of these issues at this point.

You should get more helpful error messages now, see test-errors.R for examples.

Regarding the unified interface, I think the support for timestamps by default in pb_upload / pb_download accomplishes much of the interface you might want out of pb_pull() / pb_push(). To take this one step further towards making pb_push/pb_pull dispensible, I've also modified pb_track: pb_track() now invisibly returns a list of all tracked files, and can optionally be called with no arguments. This might facilitate a common workflows such as:

pb_track("*.csv") %>% pb_upload()

Since it still writes to .pbattributes it will remember previously tracked patterns:

pb_track("*.csv")
pb_track("data/*")
write_csv(mtcars, "data/mtcars.csv") # note file doesn't have to exist when pattern was added

pb_track() %>% pb_upload()

So the only thing pull and push do that you cannot get with the above workflow is the manifest hashes, which I'm still not sure is sufficiently robust to be that useful. Maybe pull and push should be dropped at this point, but I'm still on the fence. Feedback from you, @annakrystalli , @coatless etc would be great on that call as well.

I think I've hit all the little things mentioned as well. I have a few references in the paper.md (see paper.bib), using pandoc.

mpadge commented 6 years ago

Thanks @cboettig. I'll take it our for a test ride in the coming week and report back then

coatless commented 6 years ago

@cboettig thanks for the update. I'll check it out this weekend.

annakrystalli commented 6 years ago

Hello @mpadge and @coatless! Any news on your test drives of the changes? Please let me know if you are happy that the changes address all your concerns.

@cboettig my personal feeling on:

So the only thing pull and push do that you cannot get with the above workflow is the manifest hashes, which I'm still not sure is sufficiently robust to be that useful. Maybe pull and push should be dropped at this point, but I'm still on the fence.

is to remove pull and push if they are largely redundant and maybe return them later on if you are happy they robust enough? If not, perhaps noting this background info somewhere in the documentation (so users can easily make an informed choice on which workflow best suits their needs) could also work. Ultimately I'd weigh @mpadge and @coatless feedback on this more though.

mpadge commented 6 years ago

Thanks @cboettig for the great improvements to the package, and sorry for the slight delay getting back to you. The following comments are mere suggestions that in the subjective opinions of this reviewer might just help to make it even more useful and/or easy to use.

Error messages and tags

As indicated by @cboettig, most previously hard-to-decipher error messages have been addressed, and the package is as a result considerably more user friendly. There nevertheless remain a couple of unhelpful error messages prompted by passing inappropriate arguments through to package dependencies, primarily httr. One simple one of these is:

pb_upload ("non-existent-file.Rds", repo = "mpadge/myrepo")

This still passes non-existent file name through to httr and returns file.exists(path) is not TRUE. Please insert a simple catch at start of pb_upload():

if (!file.exists (file))
    stop ("file ", file, " does not exist")

The remainder may be more effectively addressed through first extracting all tags for a repository and then using those throughout to check each tag parameter. The extraction of tags is very simple using the graphql interface. My suggestion would be to ask @maelle if you might kindly borrow her create_client function from the ghrecipes package:

create_client <- function(){
  token <- Sys.getenv("GITHUB_GRAPHQL_TOKEN")
  if (token == "")
  {
    stop ("Please set your GITHUB_GRAPHQL_TOKEN environment variable.")
  } else
  {
    if (!exists ("ghql_gh_cli")) {
      ghql_gh_cli <- ghql::GraphqlClient$new (
        url = "https://api.github.com/graphql",
        headers = httr::add_headers (Authorization =
                                     paste0 ("Bearer ", token))
      )
      ghql_gh_cli$load_schema()
      ghql_gh_cli <<- ghql_gh_cli
    }
    return (ghql_gh_cli)
  }
}

A get_tags function could then simply be written like this:

get_tags <- function (owner, repo)
{
    qtxt <- paste0 ('{
                    repository(owner:"', owner, '", name:"', repo, '") {
                        tags:refs(refPrefix:"refs/tags/", last:50) {
                            edges {
                                tag:node {
                                    name
                                }
                            }
                        }
                    }
    }')
    qry <- ghql::Query$new()
    qry$query('gettags', qtxt)
    dat <- create_client()$exec(qry$queries$gettags) %>%
        jsonlite::fromJSON ()
    if (is.null(dat$data$repository)) stop(dat$errors$message)

    dat$data$repository$tags$edges
}
get_tags ("mpadge", "junk")

This obviously introduces another dependency (ghql), but I suspect this package will likely be improved at some stage through incorporating ghql calls anyway. These tags can be cached somehow on the first call for a give repo, and then re-used throughout a session. A simple

tags <- memoise::memoise (get_tags (owner, repo))

would do the trick (sorry, requiring yet another dependency, but I also really think the package will at some stage require some more sophisticated caching like this).

This would then allow the kind of prompt I suggested in the initial review along the lines of

pb_upload (..., tag = "non-existent-tag")
# Tag 'non-existent-tag' does not exist in that repository. Would you like it to be created?

And it would happen instantly, without recourse to any API calls.

This would also enable error messages for the following call:

pb_new_release(repo = "mpadge/junk", tag = "this-tag-already-exists")

At the moment this still returns the entirely illegible:

# Error in pb_new_release(repo = "mpadge/junk", tag = "this-tag-already-exists") :
#   Unprocessable Entity (WebDAV; RFC 4918) (HTTP 422).

other stuff

pb_list

This function still does not return tag information - did you intend to incorporate that? I would reiterate most of the comments I previously made regarding this function.

pb_delete

This call now returns an appropriate error message:

pb_delete (file = "non-existent-data", repo = "mpadge/junk", tag = "v1")
# non-existent-data not found on GitHub

As previously stated, it would still be nice to be able to delete an entire release. At the moment this is not possible:

pb_delete (repo = "mpadge/junk", tag = "v1")
# Error in asset_filename(file) :
#   argument "file" is missing, with no default

Any comments on potentially enabling that?

pb_track

This does indeed work as @cboettig suggests, but I note that pb_upload is not vectorised:

pb_track ("data/*") # where * matches >= 2 files
pb_track () %>% pb_upload (repo = "mpadge/junk", tag = "v1")
# Error in vapply(elements, encode, character(1)) :
#   values must be length 1,
#  but FUN(X[[1]]) result is length 2

It would be very useful to simply make current pb_upload some kind of pb_upload_one function and then have pb_upload be a simple vapply call to that.

I'll leave my comments there for the moment, and await feedback from @cboettig

cboettig commented 6 years ago

Thanks @mpadge, really appreciate the detailed feedback and agree these things should be done. My apologies that most of the things in "other stuff" look like oversights on my part, and should be straight forward to implement.

The only thing I'm on the fence on is the tags etc, not because I disagree about doing it, but that I think it's going to require more thought on the right way to do this. Like you say, ultimately this package should really be caching some relatively detailed metadata about the package. Of course this will require some more invasive surgery to implement...

I'm not sure memoise is the way to do this, since memoise invalidates the cache only when the function arguments change, and here the repo information is likely to change without that happening. Still, it would be a convenient way to do so for sure, effectively making the cache persist through a session, so it might be an decent starting point.

I think I can get all the tags easily enough from the standard REST API, instead of ghql. Will have to dig into this...

Mark, I'd also like to ask your permission to list you as a contributor to the package, if that's okay with you? Your detailed input has had a substantial contribution to the package.

mpadge commented 6 years ago

:+1: Re: "ctb": :+1: of course! Thanks!

Re: tags + caching: Yeah, some invasive surgery will be required, and yeah, memoise is perhaps not ideal for the reasons you state, but then again pb_new_release is the only function that should change tags (at least until, or if, pb_delete(tag = ...) is incorporated), so consistent cache updating would just require re-running memoise(get_tags()) within pb_new_release.

And yeah, API3 would also be entirely sufficient, because tag info will likely never run into rate restriction limits. I just got personally converted to API4 through @maelle's graphql evangelising, and don't plan on going back.

I hope all this back-and-forthing is okay with you, and I hope this doesn't come across as me being too picky. I really do just want to help make an already supremely useful package even cleaner and easier to use, and I strongly suspect that at least finding an initial system for interim caching will help make future package development much easier and smoother.

annakrystalli commented 6 years ago

Hi @cboettig! Really happy to see the discussions and development of this package so far. Feels like it is on a good road. Was just wondering whether you and @mpadge feel that an acceptable version of the package has been reached and therefore can complete the review process, with further discussed development to be noted for future versions?

cboettig commented 6 years ago

@annakrystalli I think I still need to tick off the above items, at least the easy ones, and would like to run them by @mpadge one more time. Shouldn't be too much work, I've just been 🌊 by the start of the semester and need to get 🏊 again first!

annakrystalli commented 6 years ago

OK, not to worry. I'll keep an eye on the thread for progress. 💪 🏊‍♂️

cboettig commented 6 years ago

@mpadge Okay, think I've hit all of these. A few strategic questions remain before we close this out.

First, I now have just about all functions call the internal method, pb_info(), which can get all assets associated with all tags in a single API call. Still, I haven't done any caching on this, because I cannot figure out a robust strategy for that. The only time I think caching this would be safe is within a vectorized version of the functions, and maybe I should handle that case explicitly? Otherwise, one can never be sure if a user, for instance, creates or deletes a release using the GitHub web interface in between two consecutive function calls in an interactive session. Thoughts on this?

Second, I'm warming more and more to the idea of just dropping the whole pb_push, pb_pull functions (while maybe keeping pb_track()?) e.g. this would eliminate the use of a manifest.json and hashes to know if a file had changed -- relying totally on timestamps instead. Yes / no?

@coatless , any thoughts on these?

coatless commented 6 years ago

@cboettig a few. I need to sit down and type them up. Unfortunately, I've been slammed with the semester start and meetings. I'm going to try to carve out time on Friday.

coatless commented 6 years ago

In response to the latest points raised...


Caching

Regarding caches, we must mention them in the frame of:

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Having said this, I'm not sure caching needs to be attempted given the current rate limits. If the user is authenticated, which is a key requirement of many of piggyback's functions, the number of times the API can be queried is 5000. For users who do not supply a GH_TOKEN, they have 60 queries per hour (c.f. Rate Limiting Overview, Rate Limit Rules).

If you feel you must do some level of caching, the two heavy users of pb_info() are gh_download_asset() used in pb_download() and pb_upload_file() called by pb_upload(). The easiest way to cache results would be with memoise as @mpadge initially indicated. I say this because an option does exist to supply a timeout (c.f. timeout()).

As an example consider:

# Create a memoized pb_info function that expires after 10 seconds
# Note: The presently exposed pb_info is renamed to local_pb_info in this case
pb_info <- memoise(local_pb_info, ~timeout(10))

# Grab repo information
pb_info("cboettig/piggyback")

Though, to avoid another package dependency, the cache setup could be done by maintaining your own closure with a timeout procedure. Slight code sketch:

# Make shift caching
repo_cache = function(timeout = getOption("pb.cache_timeout", 120)) {

  # Create internal tracking variables
  start_time <- repo_data <- repo <- tag <- .token = NULL

  # Initialize Clock
  start = function() { start_time <<- Sys.time() }

  # Split Time
  time_point <- function() {
    # Difference between NOW and start in seconds
    difftime(Sys.time(), start_time, units = "secs")
  }

  # Store the repo data
  cache_data <- function() {
    repo_data <<- piggyback:::pb_info(repo = repo, tag = tag, .token = .token)
    start()

    repo_data
  }

  # Create initial cache (public-facing)
  set_cache_info <- function(repo = guess_repo(), tag = NULL, .token = piggyback:::get_token()) {
    repo <<- repo; tag <<- tag; .token <<- .token

    cache_data()
  }

  # Compare to time stamp
  # If bad, then flush cache
  check_cache_state <- function() {
    if( time_point() > timeout ){
       cache_data()
    }
  }

  # Retrieve value and reset if need be
  get_cache <- function() {
    check_cache_state()
    repo_data
  }

  # Return functions
  list(set_cache_info = set_cache_info, get_cache = get_cache )
}

# Allocate the closure
pb_cache = repo_cache()

# Setup repo caching
pb_cache$set_cache_info("cboettig/piggyback")
pb_cache$get_cache()

Misc note: Glancing over memoise, it provides an impressive cloud backend to storing information on S3 + GCP that could potentially be harnessed; however, I feel that is out of the initial scope for piggyback.


Narrowing the API

I'm largely in agreement with sunsetting both pb_push() and pb_pull(). These functions heavily rely on an underlying mental model of the git. At that point, I think that the user would likely just commit into git already since the barrier of entry is significantly higher due to the need of a manifest.json file.


Package style

Style-wise, the styler::style_pkg() found a few issues. I've spun up a PR for you to include the changes in cboettig/piggyback#14

Examples

Part of the piggyback repository is serving as a "checkpoint" for data (c.f. data tag). I wonder if such an example would be better suited inside of a secondary repository. The secondary repository could then be linked from a code example in the vignette and/or README.md as an example of what is possible.


I have a few more remarks, but the night is no longer young. I'll aim to finish round 2 later in the week.

cboettig commented 6 years ago

Thanks @coatless for the detailed review.

Let me know your remaining remarks, looking forward to it!

cboettig commented 6 years ago

Okay, caching turned out to be a bit more of an adventure than I expected with memoize.

A 1 second cache on pb_info breaks a good number of tests and vignettes. Many of these issues were fixed as soon as I made sure functions that update the GitHub release (pb_upload, pb_release) would also break the cache automatically.

I also wanted a way to turn off caching, turns out memoise::timeout(0) doesn't do what you expect, so I hacked around that. Now cache duration can be set as an env var, piggyback_cache_duration, and setting it to 0 will turn caching off.

I've also sunset pb_push and pb_pull. I've updated the docs to reflect this, also making the readme briefer and creating two separate vignettes; one with the longer intro to functions that was largely in the readme, and a separate vignette on alternative approaches to piggyback.

Remaining questions

One possibly significant issue I worry about for CRAN onboarding is the "write to temporary files" complaint. This is particularly likely to bite on pb_download() and I'm not sure how best to address it without losing the convenient default behavior based on the git repo root. Maybe since it defaults to usethis::proj_get() and not "." they will be okay with it or won't notice? (Arguably by setting an RStudio Project the user has declared a location(?). proj_get() will throw an error if the user is not in a project, so my argument is that this is not an actual default location (unlike here(), which doesn't error). Thoughts on this?

mpadge commented 6 years ago

Thanks @cboettig for doing the hard work for us all on memoise::timeout() - very useful to know! And as you know, I think the retirement of pb_push/pull is a strong boost for ease and consistency of use - thanks.

As for file writing, I don't have any great insights, and would love to be assured that what you do were permissible. I note only that:

  1. I just scoured the code for usethis, and it seems they exclusively use file_temp() for writing with default tmp_dir = tempdir() at all times (yes, even when testing proj_get(), as in this example); and
  2. CRAN have very strong auto-checks for any attempts to write to non-standard locations which have caught me out previously, so current usage will definitely be noticed. Have you tried testing on all rhub platforms to see whether any of those flag anything?

An after-thought: How about having default location as usethis::proj_get(), but enabling manual override so all tests could then use tempdir()?

cboettig commented 6 years ago

@mpadge Thanks for your thoughts on this. Yeah, pb_download already supports specifying dest = tempdir() while defaulting to dest = usethis::proj_get().

I've tried to consistently make sure tests, examples, and vignettes use tempdir in the calls (well, at least tests, I probably need another pass over this to add it to examples and vignettes). I don't really like this because it makes the examples look more cumbersome than they really are, but that is somewhat unavoidable I guess. (It's not just tempdir(); following @coatless 's very excellent point that I should use a different repo for testing, but this means that all functions include a manual declaration of repo = "cboettig/piggyback-tests" as well now.

My question is though, do you think that's good enough? I'm worried that having any default location at all is technically a violation of policy. (e.g. in base function download.file(), dest has no default).

(Though I still feel this is not technically a default location, since proj_get() throws an error unless the user has set a project).

cboettig commented 6 years ago

@annakrystalli @mpadge @coatless

Okay, I think I basically have things where I want them now in response to all the super excellent feedback. You have all made this a way more usable, useful, and robust package. I'm sure it's not perfect, but I think we've hammered out a nice API and functionality to get people started.

I think we have the :+1: from Mark to onboard, but still waiting final confirmation from James, yes?

@annakrystalli Remind me any additional steps I need to do for the JOSS part? I just go over there and open an issue pointing back here? Mark/Jim, are you :+1: with the JOSS submission?

coatless commented 6 years ago

👍 to onboarding.

annakrystalli commented 6 years ago

Approved! So happy to see so much great work done on this package through review. Many thanks @cboettig for submitting and @coatless & @mpadge for your reviews! 😺

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

To-dos:

As for more details on the JOSS process, once you've got your DOI from Zenodo, use it to submit to JOSS here http://joss.theoj.org/papers/new.

I believe this triggers a [PRE-REVIEW] issue. I'll flag it as accepted to ropensci once the issue is created. 👍

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.

We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

mpadge commented 6 years ago

Thanks so much for the outstanding improvements @cboettig - I'm really looking forward to more tightly integrating this package within my workflow, and to contributing further issues to continue the constructive conversation. Congratulations!

stefaniebutland commented 6 years ago

@cboettig please do let me know if you'd like to do a post or tech note on this or arkdb!

cboettig commented 6 years ago

@stefaniebutland Thanks, yes I'd love to do a blog post on piggyback, maybe combine a bit of arkdb. I was thinking I might focus on why I wrote these two little packages, more than the technical details (each package basically has only two functions, essentially data_in and data_out), so they are conceptually simple anyway.

stefaniebutland commented 6 years ago

why I wrote these two little packages

yes!!

right now Tuesday 2018-10-16 or subsequent Tuesdays are available. Take your pick and then please submit via pull request a week prior to publication date. Some guidelines are here: https://github.com/ropensci/roweb2#contributing-a-blog-post