Closed peterhurford closed 7 years ago
Shouldn't ref just supercede version entirely? Is version particularly meaningful in the face of a ref?
@drknexus Version is useful for explicit declaration, especially if the ref does not identify a version. Think Gemfile.lock. In a future version we can auto-compute this to do full dependency resolution.
I'm sorry, I don't quite understand (I'm reading up on bundler, but I'm not familiar with it/Ruby; please forgive my lack of understanding) How could a valid ref not specify a version? DESCRIPTION requires a version field... if DESCRIPTION isn't valid then the target isn't a valid R package, is it? Thanks for your time and the pretty great package.
@robertzk - I would agree with @drknexus that a ref would always be more specific than a version and could therefore override / replace version.
Maybe some of the disconnect is that the examples for this package tend to point towards ref as a specific tag (usually a version specifier)? However, the code (at present) seems to be entirely happy with 'ref' pointing towards an arbitrary git commit hash (provided you know the Version of the package at that hash).
versions should ideally be immutable. and there's always lockbox::emptybox()
version is how we track whether a package is already cached or not
Perhaps we could shift for github packages and cache by ref rather than version? devtools::install_github uses that approach.
I started hacking a bit on caching by ref rather than version (https://github.com/robertzk/lockbox/compare/master...zapier:prod); only applicable when the ref is explicitly specified. Things went pretty well, then I hit a snag that I want to collect some thoughts/opinions on.
If everyone behaves and there is only one copy of a given package at a given version, then dependency resolution is tractable. The issue is that people aren't great at bumping version numbers - so we could very well have two refs of a package cached (from different lockfiles/versions of the same lockfile) at the same version. When we are resolving dependencies, which of those two refs should be loaded to fulfil the dependency?
As a solution, I'm leaning towards...
Thoughts, comments, concerns I haven't considered?
I think I've begun to come around to agreeing with @kirillseva on this that packages are versioned for a reason and if other people are not versioning correctly, that isn't really our fault. We should support versions, tags, and branches, but I think supporting refs correctly would just be too much gymnastics.
I guess I'm not quite there yet. I do agree that gymnastics would be required and that other people are not versioning correctly. On the other hand, that problem is so pervasive that it seems almost as if the caching or the github features of the package are rendered worthless.
Are the Ruby and Python equivs of this package saddled with the same kind of 'issue'? My sense is that Python's virtualenv doesn't cache, each virtualenv is built from a given requirements.txt on a per invocation basis. I don't have any experience with the Ruby version of this.
Would a preferable solution be to try to enable autoinstall as an option for github packages? I know I'd hate to have that set to TRUE for dplyr.
@russellpierce Would you be able to elaborate a bit more on the use case we're looking to solve with refs? Do you have examples of packages that update significantly enough without changing their package version? The more we understand the use case, the better we can tailor the solution.
I guess it depends on your threshold for 'significantly enough'. Folks don't always know the consequences of their actions when they change code. Certainly with many packages not every minor code change results in a change of version number (e.g. https://github.com/tidyverse/dplyr/commit/aeedbab5d503fa24b6e41c1d28d77161bce5cd24 ; just an example of a code change without a verison # change).
Within a dev version some things may be bugged or resolved depending on the ref I point to. In particular I remember something like that cropping up with magrittr - the github version # matched CRAN, but had a fix I needed.
In my case I'm particularly interested in dplyr. It takes a long time to compile, so it is quite nice to cache. I could point to my own version of dplyr and increment build numbers as they happen. That seems pretty unwieldy though. Moreover, if I fail to update version in the config when I update the ref, lockbox will fail to use the correct ref (that particular issue with lockbox seems intractable under the current 'version only' caching scheme).
The logic I described above perhaps sounded difficult. However, it seems like the implementation might not be too bad because the logical sequence of dependency resolution lockbox is already using does almost exactly what I said. I made a minor change to prevent the ref dir from choking the process and everything seems to be behaving as desired.
@peterhurford If these are all the gymnastics that are really required, would you be okay with this (https://github.com/robertzk/lockbox/compare/master...zapier:prod) as a PR? @robertzk any take?
@russellpierce @peterhurford Thank you for the thoughtful discussion so far, and apologies about the delay.
I rescind my earlier comment about versions supplanting refs. I agree that refs are by definition more specific and that the current behavior of ignoring changes to refs that are not accompanied by a version update requires a solution.
I like your approach so far but I am hesitant to support the behavior for two reasons.
That being said, these are relatively minor concerns. I like your solution of using the ref when available and otherwise falling back to the version. What do you think of the following approach that balances against the above concerns?
lockbox_directory/.registry
, containing metadata about each package, specifically the ref used if available. For now a serialized list will do, e.g., list(packages = list(dplyr = list("3.1.0" = list(ref = "abcdef"), "3.2.0" = list())))
. This file can be read and written to during the lockbox()
call.This retains uniformity of the schema (using package_name/version in the lockbox directory) and prevents proliferation of unnecessary refs while addressing your primary concern. Thoughts?
@robertzk Thanks for taking a look at this.
I think your point re: being version control system agnostic makes a lot of sense. I wonder if it is something we can handle when the problem comes up. As of right now, if a package is transitioned from git to mercurial don't we have a bunch of work to do anyway because we don't have an install from mercurial option? Seems like when we make those additions we could recommend an emptybox() and then apply whatever new paradigm we come up with. Does seem like a good argument for not just using the ref but something in front of it like 'git-' so that it is obvious where the package came from.
I see what you mean about unnecessary disk space. I think I take a bit of a prodigal son view of things there... disk space is so cheap these days. If things get crusty, folks can emptybox() to recover space from the really old refs. My view here is probably heavily skewed by not keeping a development environment for more than a week at a time and being more focused on deploying the code on fresh boxes. That probably isn't a suitable perspective. 🤔
One of my hopes is to avoid building up unexpected state as a result of separate calls to lockbox. Consider a first call to lockbox where I specify ref
The .registry pattern seems good in theory. In practice I wonder if it would put us back on a collision course with (nearly) simultaneous runs of lockbox. How do we maintain a sane registry in the face of multiple copies of lockbox running at the same time?
@russellpierce Great points. I was also worried about contention to the registry. I'll close this issue for now and merge your PR but at least we have something to refer back to if anyone inquires about those concerns.
Similar issue to https://github.com/robertzk/lockbox/issues/97 but potentially more serious.