If you update the ref but not the package version, the package will not update

peterhurford commented 8 years ago

Similar issue to https://github.com/robertzk/lockbox/issues/97 but potentially more serious.

russellpierce commented 7 years ago

Shouldn't ref just supercede version entirely? Is version particularly meaningful in the face of a ref?

robertzk commented 7 years ago

@drknexus Version is useful for explicit declaration, especially if the ref does not identify a version. Think Gemfile.lock. In a future version we can auto-compute this to do full dependency resolution.

russellpierce commented 7 years ago

I'm sorry, I don't quite understand (I'm reading up on bundler, but I'm not familiar with it/Ruby; please forgive my lack of understanding) How could a valid ref not specify a version? DESCRIPTION requires a version field... if DESCRIPTION isn't valid then the target isn't a valid R package, is it? Thanks for your time and the pretty great package.

peterhurford commented 7 years ago

@robertzk - I would agree with @drknexus that a ref would always be more specific than a version and could therefore override / replace version.

russellpierce commented 7 years ago

Maybe some of the disconnect is that the examples for this package tend to point towards ref as a specific tag (usually a version specifier)? However, the code (at present) seems to be entirely happy with 'ref' pointing towards an arbitrary git commit hash (provided you know the Version of the package at that hash).

kirillseva commented 7 years ago

versions should ideally be immutable. and there's always lockbox::emptybox()

kirillseva commented 7 years ago

version is how we track whether a package is already cached or not

russellpierce commented 7 years ago

Perhaps we could shift for github packages and cache by ref rather than version? devtools::install_github uses that approach.

russellpierce commented 7 years ago

I started hacking a bit on caching by ref rather than version (https://github.com/robertzk/lockbox/compare/master...zapier:prod); only applicable when the ref is explicitly specified. Things went pretty well, then I hit a snag that I want to collect some thoughts/opinions on.

If everyone behaves and there is only one copy of a given package at a given version, then dependency resolution is tractable. The issue is that people aren't great at bumping version numbers - so we could very well have two refs of a package cached (from different lockfiles/versions of the same lockfile) at the same version. When we are resolving dependencies, which of those two refs should be loaded to fulfil the dependency?

As a solution, I'm leaning towards...

If that package is listed in the lockfile with a specific ref, then I use the copy specified by the lockfile.
If that package is specified in the lockfile without a ref (either CRAN or github w/o ref), then I won't look at the ref copies when considering dependency resolution (this matches how I treat github w/o ref for caching, it just saves by version number, so no resolution for https://github.com/robertzk/lockbox/issues/97 here)
If there is no line specifying that package by ref in the lockfile, then I won't look at the ref copies when considering dependency resolution.

Thoughts, comments, concerns I haven't considered?

peterhurford commented 7 years ago

I think I've begun to come around to agreeing with @kirillseva on this that packages are versioned for a reason and if other people are not versioning correctly, that isn't really our fault. We should support versions, tags, and branches, but I think supporting refs correctly would just be too much gymnastics.

russellpierce commented 7 years ago

I guess I'm not quite there yet. I do agree that gymnastics would be required and that other people are not versioning correctly. On the other hand, that problem is so pervasive that it seems almost as if the caching or the github features of the package are rendered worthless.

Are the Ruby and Python equivs of this package saddled with the same kind of 'issue'? My sense is that Python's virtualenv doesn't cache, each virtualenv is built from a given requirements.txt on a per invocation basis. I don't have any experience with the Ruby version of this.

russellpierce commented 7 years ago

Would a preferable solution be to try to enable autoinstall as an option for github packages? I know I'd hate to have that set to TRUE for dplyr.

peterhurford commented 7 years ago

@russellpierce Would you be able to elaborate a bit more on the use case we're looking to solve with refs? Do you have examples of packages that update significantly enough without changing their package version? The more we understand the use case, the better we can tailor the solution.

russellpierce commented 7 years ago

I guess it depends on your threshold for 'significantly enough'. Folks don't always know the consequences of their actions when they change code. Certainly with many packages not every minor code change results in a change of version number (e.g. https://github.com/tidyverse/dplyr/commit/aeedbab5d503fa24b6e41c1d28d77161bce5cd24 ; just an example of a code change without a verison # change).

Within a dev version some things may be bugged or resolved depending on the ref I point to. In particular I remember something like that cropping up with magrittr - the github version # matched CRAN, but had a fix I needed.

In my case I'm particularly interested in dplyr. It takes a long time to compile, so it is quite nice to cache. I could point to my own version of dplyr and increment build numbers as they happen. That seems pretty unwieldy though. Moreover, if I fail to update version in the config when I update the ref, lockbox will fail to use the correct ref (that particular issue with lockbox seems intractable under the current 'version only' caching scheme).

russellpierce commented 7 years ago

The logic I described above perhaps sounded difficult. However, it seems like the implementation might not be too bad because the logical sequence of dependency resolution lockbox is already using does almost exactly what I said. I made a minor change to prevent the ref dir from choking the process and everything seems to be behaving as desired.

@peterhurford If these are all the gymnastics that are really required, would you be okay with this (https://github.com/robertzk/lockbox/compare/master...zapier:prod) as a PR? @robertzk any take?

robertzk commented 7 years ago

@russellpierce @peterhurford Thank you for the thoughtful discussion so far, and apologies about the delay.

I rescind my earlier comment about versions supplanting refs. I agree that refs are by definition more specific and that the current behavior of ignoring changes to refs that are not accompanied by a version update requires a solution.

I like your approach so far but I am hesitant to support the behavior for two reasons.

I'm not aware of a dependency management system that stores versions of libraries by version control refs. For example, bundler stores packages as "-" in its gems directory. If a package is translated from git to mercurial, it should be immaterial to how it is treated by a dependency manager. This is more of a philosophical point but I hesitate to tie us to git or github on an implementation level.
Versioning takes a little bit of effort to ensure you publish a material change. I worry that this encourages updating on every ref. Each copy of dplyr is 3.8MB and I have some concern here about unnecessary disk space use if we encourage very frequent updates that bypass versioning.

That being said, these are relatively minor concerns. I like your solution of using the ref when available and otherwise falling back to the version. What do you think of the following approach that balances against the above concerns?

Place a file in the lockbox directory, say lockbox_directory/.registry, containing metadata about each package, specifically the ref used if available. For now a serialized list will do, e.g., list(packages = list(dplyr = list("3.1.0" = list(ref = "abcdef"), "3.2.0" = list()))). This file can be read and written to during the lockbox() call.
If any ref conflicts with the ref found in the metadata file for that given version, erase the version and download it as usual. Update the metadata file with the new ref.

This retains uniformity of the schema (using package_name/version in the lockbox directory) and prevents proliferation of unnecessary refs while addressing your primary concern. Thoughts?

russellpierce commented 7 years ago

@robertzk Thanks for taking a look at this.

I think your point re: being version control system agnostic makes a lot of sense. I wonder if it is something we can handle when the problem comes up. As of right now, if a package is transitioned from git to mercurial don't we have a bunch of work to do anyway because we don't have an install from mercurial option? Seems like when we make those additions we could recommend an emptybox() and then apply whatever new paradigm we come up with. Does seem like a good argument for not just using the ref but something in front of it like 'git-' so that it is obvious where the package came from.

I see what you mean about unnecessary disk space. I think I take a bit of a prodigal son view of things there... disk space is so cheap these days. If things get crusty, folks can emptybox() to recover space from the really old refs. My view here is probably heavily skewed by not keeping a development environment for more than a week at a time and being more focused on deploying the code on fresh boxes. That probably isn't a suitable perspective. 🤔

One of my hopes is to avoid building up unexpected state as a result of separate calls to lockbox. Consider a first call to lockbox where I specify ref for package version 1.0.0. Then consider a second call to lockbox where I specify package version 1.0.0 but no ref (or don't mention it at all but it is loaded implicitly as the dependency of another package). If I wanted I'd have asked for blah. I think what I'd hope for in a stateless system that I get version 1.0.0 just as it would have been if I'd not made the first call (e.g. the CRAN version). Do you agree?

The .registry pattern seems good in theory. In practice I wonder if it would put us back on a collision course with (nearly) simultaneous runs of lockbox. How do we maintain a sane registry in the face of multiple copies of lockbox running at the same time?

robertzk commented 7 years ago

@russellpierce Great points. I was also worried about contention to the registry. I'll close this issue for now and merge your PR but at least we have something to refer back to if anyone inquires about those concerns.

robertzk / lockbox

If you update the ref but not the package version, the package will not update #99