ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

External packages as dependencies #6

Closed wlandau-lilly closed 6 years ago

wlandau-lilly commented 7 years ago

Sometimes I write custom one-off packages and develop them alongside the drake workflows I am working on. So maybe the code analysis should walk deeper into functions from packages. The current behavior is to walk all the way through the functions in the environment (to discover and track any functions nested in user-defined functions) but stop at functions from packages (including base).

AlexAxthelm commented 6 years ago

Would it make sense to have a "package import" list, as part of the Import step? Have a single target for all the internal functions in addition to all the functions that are actually called in plan$command?

wlandau-lilly commented 6 years ago

I tried the latter at one point, but I decided to keep imports totally separate since v2.0.0. I was thinking of maybe

  1. a recursive version of import_dependencies() that dives into non-base functions, and/or
  2. appending all non-base package functions to candidate imports and let the rest of build_graph() figure out the rest.

I am on the fence about whether to track base, and to be honest, whether to dive into packages at all.

AlexAxthelm commented 6 years ago

it might be simplest to code to have the package itself be the import target. This would err on the side of rebuilding too much, since an update to a package might not touch the few functions that I care about, though.

wlandau-lilly commented 6 years ago

That does seem like the most practical option because it avoids the problem of having a massive import list. We could

  1. Look at all the imports.
  2. Look at where each imported object/function is defined (i.e., environment(knit)).
  3. If the import is from a package, list that whole package as a dependency.

import_dependencies() could take care of this (making sure to check both imports and the dependencies of those imports), but other things need to be adjusted, including graphing and build.R. All package targets should probably be named something like package:xyz, and we will probably need an import_package() function in build.R.

Concerns:

  1. Back-compatibility will be affected. Old workflows would need to rerun from scratch. (Would we need to increase the major version to 5? I would rather not have to do that.)
  2. Hashing packages could be time-consuming. Not all packages respect the 5MB limit. (This is minor.)
AlexAxthelm commented 6 years ago

The cheap way to get around [2] would be to track version number (for CRAN), or commit hash (github). That would offload the hashing work to the maintainers, rather than the users. At that point, the main danger would be a user happily hacking away at their local version of a package, and not making a commit before make.

I fell is a situation which, at least in the short term, could be dealt with using a section in the caution vignette.

wlandau-lilly commented 6 years ago

That makes sense, and I believe this is what packrat does. For now, we could just assume local packages do not have a CRAN/Bioconductor presence or a GitHub commit hash, and as you said, put the obvious pitfalls in the caution vignette.

I should say that I am waiting to develop on this and other issues to see if my company's open source procedure changes soon. We have a solid draft of the new procedure, and it is on route to final approval. I will meet this afternoon with a gatekeeper from "Quality", who will hopefully give me a better idea of when I can just push to drake without bureaucratic delays.

AlexAxthelm commented 6 years ago

That is exciting! I've had a few other projects pop up lately, but I'm hoping to put a away a few files on #40 today, since that seems like the most reasonable step at this point. I think that #77 / #79 is complex enough that having clear direction on push/pull requirements will help a lot.

wlandau-lilly commented 6 years ago

It is exciting! The gatekeeper I met today was extremely supportive and willing to work fast. There is barely any editing left. On the other hand, many people have to see it and sign off, so I do not know exactly what to expect.

I look forward to seeing what you have done for #40. I think that will also help to free up more development without fear of merge conflicts.

wlandau-lilly commented 6 years ago

Regarding the original issue, tracking overall packages strikes me as a straightforward way to reproducibly track compiled C/C++/Fortran code (for use in .Call(), etc.). I do not know how else we could do this easily.

wlandau-lilly commented 6 years ago

I thought more about this, and I am leaning more and more toward tracking the versions/hashes of CRAN/Bioconductor/GitHub/Bitbucket packages. However, we should suppress this feature for versions 4.1.0 and prior. Otherwise, the a drake update will trigger rebuilds in everyone's workflows. We're already in major version 4, and I need to be better about respecting back-compatibility.

For a custom local packages, things get a little trickier. I think we should stick to a simple, straightforward fix that is unlikely to surprise users. Maybe that means we do not track them at all, which would make sense because local packages are an edge case anyway. The user could still recursively walk through package functions by appending to the execution environment for make().

envir <- as.environment(
  c(
    as.list(envir),
    as.list("custom_package_1"),
    as.list("custom_package_2")
  )
)

This hack belongs in the caution vignette anyway.

Also, we will need to update tracked() and deps() to include all the tracked packages.

AlexAxthelm commented 6 years ago

I wonder if it might be worth it to add a drake_version field to config, so that we can make drake aware of situations like this?

More relevant though, it might be best to wrap this up in plan(), rather than make() so that previously created plans run without issue, but any new plans generated going forward would have packages included. That would also make explicit the expected package version in the plan, rather than hiding in environment data.

   target                              command
1   dplyr   packageVersion('dplyr') == '0.7.2'
2 ggplot2 packageVersion('ggplot2') == '2.2.1'

I'm still trying to figure out how to make the import have a dependency on the package version though. It would make sense to only re-run objects that have been affected by updates.

AlexAxthelm commented 6 years ago

Not unrelated: Should we also track R_version in the same way?

wlandau-lilly commented 6 years ago

Thankfully, session() already covers the versions of drake and R.

load_basic_example()
make(my_plan, verbose = FALSE)
s <- session()
s$otherPkgs$drake$Version
## "4.1.1.9000"
s$R.version$major
## "3"
s$R.version$minor
## "4.0"

It is a good thought to consider package dependencies via plan(). However, I foresee a few issues.

  1. Currently, imports cannot depend on targets in the workflow plan.
  2. Users might create workflow plan data frames with functions other than plan().
  3. Not all package dependencies can be detected this way unless all the imports are also listed as targets in the plan.

I do like your thinking on this, though. Ultimately, the fix should look this clean and elegant.

wlandau-lilly commented 6 years ago

Here are some of the changes I am thinking about.

deps()

Current behavior:

head(deps(lm))
## [1] ".getXlevels" "as.vector"   "attr"        "c"           "eval"       
## [6] "gettextf"  

This is already wrong because drake does not do code analysis on lm(). Desired behavior:

deps(lm)
## "package:stats"
deps(knitr::knit)
## "package:knitr"

No back compatibility concerns here.

New function is_package()

build_graph()

build.R

store_package <- function(target, value) {
  config$cache$set(key = target, value = list(type = "package",
    value = value))
}

plot_graph(), etc.

Testing: test-packages.R

It would be impractical to install packages inside a unit test. I think a reasonable workaround is to

Testing for back compatibility

I am not sure if we can, or even should, build this into a unit test. However, it is worth building a project using functions from packages and drake 4.1.0, then check that everything is up to date for the later drake.

Documentation

The caution vignette should have a note about how packages are tracked and the dangers of custom local packages installed from the source.

By the way, for packages installed from the source, I think version numbers should be fine. Such packages could be personal projects by the user, or they could just be uncooperative external non-CRAN packages that need to be installed this way. Most installations from the source are probably the latter.

For packages loaded with devtools::load_all(), I do not know what will happen. In fact, the caution vignette should advise against loading packages this way for make().

Lingering thoughts

Am I missing anything? I think this is simple enough to implement, and I think it is the right way to go. It did just occur to me that users who build custom workflows as packages will have an even harder time, but the change is still worth it.

AlexAxthelm commented 6 years ago

Did a little digging this morning, and we might want to make use of devtool:::local_sha() (Not exported)

I don't have any bioc or bitbucket repos installed, but it's got comparable performance for packages installed from CRAN of Github.

Mostly making a note here, before I start work on other stuff, but later this week, I hope to look at how to integrate this into function_dependencies().

> microbenchmark::microbenchmark(CRAN1 = devtools:::local_sha("ggplot2"), CRAN2 = devtools:::local_sha("drake"), GH1 = devtools:::local_sha("lintr"), GH2 = devtools:::local_sha("hrbrthemes"))
Unit: milliseconds
  expr   min     lq    mean median     uq   max neval
 CRAN1 2.662 2.7565 2.93691 2.8550 3.0195 3.563   100
 CRAN2 2.606 2.7025 2.87879 2.7925 2.9690 4.465   100
   GH1 2.562 2.6680 2.86591 2.7555 2.9815 3.673   100
   GH2 2.607 2.7170 3.00867 2.8620 3.0770 5.770   100
> list(CRAN1 = devtools:::local_sha("ggplot2"), CRAN2 = devtools:::local_sha("drake"), GH1 = devtools:::local_sha("lintr"), GH2 = devtools:::local_sha("hrbrthemes"))
$CRAN1
[1] "2.2.1"

$CRAN2
[1] "4.1.0"

$GH1
[1] "15fd8ee6685248b7f477c949c1e78dc13350cd15"

$GH2
[1] "1fd0301ce07f3e025f1a259008ea74f251f9d48b"
AlexAxthelm commented 6 years ago

Also a note: getNamespaceName(environment(dplyr::lag)) works, where find("dplyr::lag") does not. find() requires the namespace to be explicitly loaded previously, and not be referenced by the :: namespace identifier. find() returns all the packages with a function name, but the getNamespaceName method shows the actual relevant one on the search path.

> getNamespaceName(environment(dplyr::lag))
   name
"dplyr"
> getNamespaceName(environment(stats::lag))
   name
"stats"
> getNamespaceName(environment(base::lag)) #Does not exist
Error in get(name, envir = ns, inherits = FALSE) : object 'lag' not found
> getNamespaceName(environment(lag)) #Default Search Path
   name
"stats"
> getNamespaceName(environment(mutate)) # Not on Search Path Yet.
Error in environment(mutate) : object 'mutate' not found

> find("lag")
[1] "package:stats"
> find("mutate")
character(0)
> find("dplyr::lag")
character(0)
> find("dplyr::mutate")
character(0)
> suppressPackageStartupMessages(library("dplyr"))
> getNamespaceName(environment(dplyr::lag))
   name
"dplyr"
> getNamespaceName(environment(stats::lag))
   name
"stats"
> getNamespaceName(environment(lag)) #Default Search Path
   name
"dplyr"
> find("lag")
[1] "package:dplyr" "package:stats"
> find("mutate")
[1] "package:dplyr"
> find("dplyr::lag")
character(0)
> find("dplyr::mutate")
character(0)
>
wlandau-lilly commented 6 years ago

I just submitted an SO post in hopes of finding an already-exported version of devtools:::local_sha(). Good call on getNamespaceName() vs find(). We may want to wrap getNamespaceName() in a tryCatch() statement, I think it should return empty-handed instead of throwing an error.

AlexAxthelm commented 6 years ago

Looks good. If there aren't any bites on that, the answers to this SO question are informative. The main options being:

  1. copy the code itself directly into drake, and provide attribution
  2. Use fun = getFromNamespace("fun", "pkg") to get around CRAN's restriction on the ::: operator

Neither would add anything to drake's dependencies that isn't actually there, but [2] would mean that a change to devtools:::local_sha() might change the local hashes that drake uses. I don't know how you feel about maintaining a drake copy of that function.

wlandau-lilly commented 6 years ago

I vote for (1) if it comes to that, but it will be difficult to extricate local_sha() from the rest of the devtools internals.

AlexAxthelm commented 6 years ago

Agreed. If we do end up going route 1, I wonder if it would be preferable to make it its own package, rather than a part of drake.

wlandau-lilly commented 6 years ago

From @gaborcsardi and @jimhester:

install_github("hadley/devtools")
packageDescription("devtools")$RemoteSha 
## [1] "a0c8d7333be185f979a60823dcfad66c9f3ea35e"
packageDescription("devtools")$Version
## [1] "1.13.3.9000"
wlandau-lilly commented 6 years ago

One note: neither this solution nor devtools:::local_sha() accounts for packages installed with install.packages("package", type = "source", repos = NULL). We may want to compute a hash of those packages.

AlexAxthelm commented 6 years ago

Worth noting that packageDescription("devtools")$RemoteSha doesn't exist for packages installed from CRAN (or from local source), and that $Repository doesn't exist for ones built from local source.

A cheap alternative that would work for all packages, regardless of origin would be to track packageDescription("dplyr")$Built instead of hash or version. However, this would force a full re-run of the project for each new version of R (at least).

One way or the other, we need to put a note in caution about using devtools::load_all(), which bypasses the build/install process entirely.

wlandau-lilly commented 6 years ago

So many edge cases and caveats...

I think there is a need for a new separate tool to just get the same hash for a package regardless of origin or installation. Ideally, the hash should only tell us that the functionality changed. Just as drake itself ignores changes to whitespace and comments, a package hash should only react to meaningful changes in R/, src/, data/, and possibly Makefiles. (Or, just focus on the parts of the package loaded into R rather than external files.) Version numbers are too lax, and GitHub SHA keys are too conservative.

Possible names for this hypothetical new tool: packhash, SHAckage, ...

AlexAxthelm commented 6 years ago

Related, have you seen available?

jimhester commented 6 years ago

FWIW if you want to go this route the way to do it would be to hash a list of all objects in the namespace. You will need to strip srcref's from functions, but otherwise that should get you pretty robust signatures regardless of providence.

AlexAxthelm commented 6 years ago

I like that idea, but digest::digest(ls("package:dplyr")) would only tell us when a function is added or removed. It wouldn't actually tell us when a function's internals change.

Unless I'm misunderstanding?

wlandau-lilly commented 6 years ago

@AlexAxthelm I think the idea is to hash the content:

library(dplyr)
env <- getNamespace("dplyr")
objects <- lapply(ls(env), function(name){
  contents <- env[[name]]
  attr(contents, "srcref") <- NULL
  contents
})
package_fingerprint <- digest::digest(objects)

@jimhester is this more or less what you mean? It does seem to meet the need.

On the other hand, I just realized that a package needs to be loaded for getNamespace() to work. If :: is used to reference a depending function, this behavior may slow down drake or load packages in the wrong order. For example, the following would load plyrafter dplyr, possibly against the intentions of the user.

library(dplyr)
my_plan <- plan(
  cars_per_cylinder = plyr::ddply(mtcars, "cyl", nrow),
  strings_in_dots = "literals"
)
my_plan
##              target                          command
## 1 cars_per_cylinder plyr::ddply(mtcars, "cyl", nrow)
plot_graph(my_plan) # loads plyr after dplyr

g2

make(my_plan) # loads plyr after dplyr
## check 1 item
## package plyr
## check 3 items
## import mtcars
## import nrow
## import plyr::ddply
## check 1 item
## target cars_per_cylinder

Some possible workarounds:

  1. Find a version of getNamespace() that does not require the package to be loaded.
  2. Detach the extra packages after the dependencies are resolved but before any calls to build().
  3. Find a way to make use of the file system after all.
wlandau-lilly commented 6 years ago

Also: @AlexAxthelm, I had not heard of available, but I am glad I know now. It would have been really handy before I started this project. The name conflict with Factual's Drake is accidental and unfortunate.

wlandau-lilly commented 6 years ago

And I can't believe I forgot: loading packages with devtools::load_all() is likely to cause problems for drake anyway, particularly with the parLapply and Makefile parallel backends. That means all package dependencies should be properly installed. This deserves a mention in the caution vignette, and it makes solving this issue easier.

jimhester commented 6 years ago

The order namespaces are loaded doesn't really matter FWIW, the only time you have an issue with ordering is the order packages are attached. Also if you want to loop over an environment eapply is the function you are looking for.

digest::digest(eapply(asNamespace("plyr"), function(x) { attr(x, "srcref") <- NULL; x }, all.names = TRUE))
#> [1] "d15575517a659959089d7ab5094fc63d"

There really isn't a way to get objects in a namespace without loading the namespace. If you are really concerned about loading namespaces do the fingerprinting in a separate R process, there are some namespaces that cannot be unloaded cleanly.

AlexAxthelm commented 6 years ago

Are we way overthinking this? If a package is built and installed (can be called using library) there is a nice rds file that we can hash (on the rder of a few milliseconds), which will be there regardless of the package's provenance. The hash shouldn't change until the user re-installs using (install.packages(), or one of the many devtools::install_* functions). As a bonus, this might give a handle on load_all() since that makes the file for packageDescription as the DESCRIPTION file for the package, so that could be a check that we look for, and then maybe hash the contents of the package's R/ directory at that point.

> install.packages("secret")
Installing package into 'C:/Users/aaxthelm/R_LIBS'
(as 'lib' is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'https://cloud.r-project.org/bin/windows/contrib/3.4/secret_1.0.0.zip'
Content type 'application/zip' length 322663 bytes (315 KB)
downloaded 315 KB

package 'secret' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
        C:\Users\aaxthelm\AppData\Local\Temp\RtmpyidDW1\downloaded_packages
> desc <- packageDescription("secret")
> names(desc)
 [1] "Package"          "Title"            "Version"          "Authors@R"
 [5] "Description"      "License"          "LazyData"         "URL"
 [9] "BugReports"       "RoxygenNote"      "Imports"          "Suggests"
[13] "Encoding"         "VignetteBuilder"  "NeedsCompilation" "Packaged"
[17] "Author"           "Maintainer"       "Repository"       "Date/Publication"
[21] "Built"
> file_path <- attr(desc, "file")
> file_path
[1] "C:/Users/aaxthelm/R_LIBS/secret/Meta/package.rds"
> microbenchmark::microbenchmark(
+ cran_hash <- digest::digest(file_path, file = TRUE))
Unit: milliseconds
                                                expr      min       lq     mean
 cran_hash <- digest::digest(file_path, file = TRUE) 1.195488 1.208411 1.669178
   median       uq      max neval
 1.237129 1.332308 38.56166   100
> cran_hash
[1] "83b11245a73940d244b59835ad38f41f"
> devtools::install_github("gaborcsardi/secret")
Downloading GitHub repo gaborcsardi/secret@master
from URL https://api.github.com/repos/gaborcsardi/secret/zipball/master
Installing secret
"C:/PROGRA~1/R/R-34~1.1/bin/x64/R" --no-site-file --no-environ --no-save  \
  --no-restore --quiet CMD INSTALL  \
  "C:/Users/aaxthelm/AppData/Local/Temp/RtmpyidDW1/devtools67cc34106644/gaborcsardi-secret-bf19024"  \
  --library="C:/Users/aaxthelm/R_LIBS" --install-tests

* installing *source* package 'secret' ...
** R
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (secret)
> desc <- packageDescription("secret")
> names(desc)
 [1] "Package"         "Title"           "Version"         "Authors@R"
 [5] "Description"     "License"         "LazyData"        "URL"
 [9] "BugReports"      "RoxygenNote"     "Roxygen"         "Imports"
[13] "Suggests"        "Encoding"        "VignetteBuilder" "Author"
[17] "Maintainer"      "Built"           "RemoteType"      "RemoteHost"
[21] "RemoteRepo"      "RemoteUsername"  "RemoteRef"       "RemoteSha"
[25] "GithubRepo"      "GithubUsername"  "GithubRef"       "GithubSHA1"
> file_path <- attr(desc, "file")
> file_path
[1] "C:/Users/aaxthelm/R_LIBS/secret/Meta/package.rds"
> microbenchmark::microbenchmark(
+ gthb_hash <- digest::digest(file_path, file = TRUE))
Unit: milliseconds
                                                expr      min       lq   mean
 gthb_hash <- digest::digest(file_path, file = TRUE) 1.334975 1.792001 1.9355
   median      uq      max neval
 1.964719 2.10277 2.665027   100
> gthb_hash
[1] "850280861ca6e17455c0686b2582360c"
> identical(cran_hash, gthb_hash)
[1] FALSE
>
wlandau-lilly commented 6 years ago

@AlexAxthelm I think you are on to the right idea. However, I do have a preference for functionality rather than metadata. What about the rdb and rdx files?

get_from_rdb <- function(symbol, filebase, envir =parent.frame()){
  lazyLoad(filebase = filebase, envir = envir, filter = function(x) x == symbol)
}
get_from_rdb(symbol = "load_all",  filebase = "~/.R/library/devtools/R/devtools")
## NULL
load_all
# body of devtools::load_all()
wlandau-lilly commented 6 years ago

In fact, I think we can get away with just the rdb file. I checked on a local package, and this file responds to changes in functionality but not formatting (probably because the functions are deparsed as well as serialized). For packages loaded with devtools::load_all(), we could use @jimhester's one-liner with eapply() and/or throw a warning.

We just need to keep in mind that we would no longer be using someone else's API, so this could suddenly break if the R core team decides to implement lazy loading differently. Maybe we want to hash the less sensitive components of packageDescription(), such as the version, if the rdb file cannot be found.

jimhester commented 6 years ago

The rdb file is going to contain source references if the package is installed with --with-keep.source or if options(keep.source.pkgs = TRUE) or if the system environment variable R_KEEP_PKG_SOURCE=yes. If the functions have source references they will likely contain directory information which will likely differ between installations.

If you don't need coherency across systems then this isn't a problem, but something you should keep in mind if you do.

wlandau-lilly commented 6 years ago

Thank you @jimhester, I would not have known otherwise. Coherency across systems is important here. It will take a little extra work to scrape the source reference data.

It might also be important to recursively hash the files in include/ to account for compiled code. I just noticed that packages like BH have include/ but not R/ in the installed file system.

I think this is nuanced enough to offload to a separate package. The functionality could generalize beyond the needs of drake and fingerprint multiple aspects of installed/loaded packages.

AlexAxthelm commented 6 years ago

Agreed. If you want, I can start work on this, but it will probably be some time next week before I have a beta ready.

wlandau-lilly commented 6 years ago

@AlexAxthelm If it sounds fun to you, absolutely! Whatever works for you. I am super grateful for the work you have already done, especially while my hands are still tied.

wlandau-lilly commented 6 years ago

The proposed unit test in #53 will suffice to check for back-compatibility here too.

wlandau-lilly commented 6 years ago

Maybe @trinker would know if there is a way to hash the functionality of a package without having to hack into the *.rdb file.

wlandau-lilly commented 6 years ago

I did some more work on this, and @jimhester's eapply() one-liner is turning out to be the most practical way of doing things. As much as I would like to hash compiled code, we only reliably have the shared library in the package's file system, and I think it would be unwise to hash that. And hashing a package's datasets would be time-consuming and usually unnecessary. Let's stick to just the functions. I no longer think we need a separate tool.

I also noticed that the base package throws a warning when I try to hash it, so for that one, I am hashing R.version instead.

At this point, I need to fix some difficult bugs, enforce back compatibility, and see how much useful testing I can add.

wlandau-lilly commented 6 years ago

Correction: I am using R.version$version.string, not R.version (and for all of the base packages, not just "base"). I do not want projects to rerun just because the machine changes. Even depending on the R version is a bit much.

Also, it may not be enough to remove the srcrefs. When I try installing the same package to two local libraries, I am getting different hashes (see the MWE below). For now, I have decided to deparse the functions instead, which is what drake does with the user's functions anyway. Unfortunately, deparsing is slower, so I would rather find some way to clean and serialize loaded code.

withr::with_dir(new = tempdir(), code = {

# Remove srcrefs from all the functions,
# then hash them all together.
hash_without_srcref <- function(pkg){
  digest::digest(
    eapply(
      asNamespace(pkg),
      function(x) {
        attr(x, "srcref") <- NULL
        x
      },
      all.names = TRUE
    )
  )
}

# Install a dummy package in a local library.
lib <- "local_lib"
dir.create(lib)
pkgenv <- new.env()
pkgenv$newfunction <- function(x){
  x + 1
}
withr::with_message_sink(
  new = tempfile(),
  code = {
    package.skeleton(name = "newpkg", environment = pkgenv)
  }
)
unlink(file.path("newpkg", "man"), recursive = TRUE)
install.packages("newpkg", type = "source", repos = NULL,
  lib = lib, quiet = TRUE)

# Install the same package in a different local library.
lib2 <- "local_lib2"
dir.create(lib2)
install.packages("newpkg", type = "source", repos = NULL,
  lib = lib2, quiet = TRUE)

# Compute the hash of newpkg in each local library.
withr::with_libpaths(
  new = c(lib, .libPaths()),
  code = {
    unloadNamespace("newpkg")
    hash1 <- hash_without_srcref("newpkg")
  }
)
withr::with_libpaths(
  new = c(lib2, .libPaths()),
  code = {
    unloadNamespace("newpkg")
    hash2 <- hash_without_srcref("newpkg")
  }
)

# I want them to be equal, but they are different.
testthat::expect_equal(hash1, hash2)
})
wlandau-lilly commented 6 years ago

Another concern about this whole issue: even if I deparse all the functions before hashing, some package hashes turn out different on different platforms. Example: knitr. On the issue6 branch of drake:

pkg_hash <- function(pkg){
  digest::digest(eapply(asNamespace(pkg),
    FUN = deparse, all.names = TRUE))
}

On Windows R-3.4.0:

packageVersion("knitr")
## [1] ‘1.17’
pkg_hash("knitr")
## [1] "70d9000607afbf3b1bf87651d73ce55d"

On Red Hat Linux R-3.4.0:

packageVersion("knitr")
## [1] ‘1.17’
pkg_hash("knitr")
## [1] "ae52c9275b69186123ef062f98d42b15"

In my opinion, these hashes overreact to differences in installed instances of packages. I think we will have to go with the most transparent thing: just depend on package versions. It's not ideal, but the users will know exactly what they are getting.

wlandau-lilly commented 6 years ago

The current solution to this issue is sufficiently implemented, tested, and documented for the master branch. For anyone just joining the thread, you might have a look at the new addendum to the reproducibility section of the README, or the analogous mentions in the drake.Rmd vignette or quickstart vignette.

There are obvious flaws in exclusively watching the version number of a package, but the other solutions are not much better. It is time to prioritize something straightforward. That way, users know exactly what they are getting and thus have complete control. The extra documentation and the additional functionality in plot_graph() should help. Now, when you hover over a package node, it shows the version number for a regular package and the R version string for a base package, which is exactly where drake watches for changes.

AlexAxthelm commented 6 years ago

Got bogged down with other work, but I think I might make working towards this a big part of my Hacktober projects. My current line of thought is to create a new package that would be able to detect changes in other packages, probably by doing most of the heavy lifting once when the package loads, and then looking for incremental changes after that.

AlexAxthelm commented 6 years ago

A nifty thing that I found re: load_all() and customized functions in the workspace though, is that direct comparison to the package function is faster than hashing. I need to check more the ensure that this holds for most functions, but with the ones that I've tried so far, it's been good.

> library(tidyverse)
> library(microbenchmark)
> library(digest)
> stored_hash <- digest(read_csv)
> read_csv <- function(...){cat(file);readr::read_csv(...)}
> times <- microbenchmark(
+ direct = identical(read_csv, readr::read_csv),
+ hashes = {new_hash <- digest(read_csv); new_hash == stored_hash}
+ )
> times %>% autoplot()
> times
Unit: microseconds
   expr    min      lq      mean median      uq       max neval
 direct  6.565  9.2310  11.24539 11.488 12.7190    19.282   100
 hashes 43.898 47.3845 310.99931 67.282 68.5135 24725.754   100
wlandau-lilly commented 6 years ago

Interesting. I wonder how long a function would have to be before hashing becomes the faster approach.

The complete solution using package versions is already implemented and documented in development drake, but I am willing to change direction if you can solve this problem. I could not myself, even though I tried to remove srcrefs. (In fact, the srcrefs were not even there to begin with.)

wlandau-lilly commented 6 years ago

Another note: I just tried utils::removeSource(), but with the same failure. See this SO post.

wlandau-lilly commented 6 years ago

See #103. After thinking it over, I believe tracking packages is outside the scope of drake, and I have already reverted my work on this. Packrat is a much better solution. It extends shelf lives and preserves projects. Drake, on the other hand, would shorten shelf lives and break projects if it tried to accomplish its own version of the same thing. Packrat is the ounce of prevention to drake's pound of cure. I explicitly recommend packrat and Docker in the README and vignettes.