ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Is there a way to stop dynamic file targets from rerunning across different machines? #1350

Closed shirdekel closed 3 years ago

shirdekel commented 3 years ago

Prework

Question

I work on my drake plan between two computers, so dynamic file targets have to rerun each time I switch computers (even without any changes to the plan itself). I'm assuming that this is because drake senses that the file path has changed, due to the different beginning of the file paths between computers.

Is there a way of avoiding this?

Reproducible example

In other words, can the below be done without having to rerun the directory target?

library(drake)

file_machine_1 <- file.path(tempdir(), "x")
file.create(file_machine_1)
#> [1] TRUE
file_machine_2 <- file.path(tempdir(), "x")
file.create(file_machine_2)
#> [1] TRUE

plan <-
    drake_plan(
        directory = target(
            file_machine_1,
            format = "file"
        )
    )

make(plan)
#> ▶ target directory

plan <-
    drake_plan(
        directory = target(
            file_machine_2,
            format = "file"
        )
    )

make(plan)
#> ▶ target directory

Created on 2020-12-10 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.2 (2020-06-22) #> os macOS Mojave 10.14.6 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Sydney #> date 2020-12-10 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) #> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2) #> base64url 1.4 2018-05-14 [1] CRAN (R 4.0.0) #> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2) #> cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.2) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) #> devtools 2.3.0 2020-04-10 [1] CRAN (R 4.0.0) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) #> drake * 7.12.6.9000 2020-10-22 [1] Github (ropensci/drake@cf85aa9) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0) #> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0) #> filelock 1.0.2 2018-10-05 [1] CRAN (R 4.0.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0) #> hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.0) #> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.1) #> igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.2) #> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.2) #> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) #> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) #> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2) #> progress 1.2.2 2019-05-16 [1] CRAN (R 4.0.0) #> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) #> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2) #> rlang 0.4.9 2020-11-26 [1] CRAN (R 4.0.2) #> rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.2) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) #> storr 1.2.4 2020-10-12 [1] CRAN (R 4.0.2) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) #> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0) #> tibble 3.0.4 2020-10-12 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0) #> txtq 0.2.3 2020-06-23 [1] CRAN (R 4.0.2) #> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0) #> vctrs 0.3.5 2020-11-17 [1] CRAN (R 4.0.2) #> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2) #> xfun 0.19 2020-10-30 [1] CRAN (R 4.0.2) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```
wlandau commented 3 years ago

tempdir() is not only different for different machines but also for different R sessions.

# on Linux
$ Rscript -e 'tempdir()'
[1] "/tmp/RtmpOofH4k"
$ Rscript -e 'tempdir()'
[1] "/tmp/RtmpW6PZzE"

Different machines also have different directory structures.

# on Mac OS
tempdir()
#> [1] "/var/folders/k3/q1f45fsn4_13jbn0742d4zj40000gn/T//RtmpSJ4JdL"

The project should stay up to date if you use relative file paths (relative to the project root).

library(drake)
library(fs)
dir_create("x")
writeLines("lines", "x/y")
plan <- drake_plan(x = target("x/y", format = "file"))
make(plan)
shirdekel commented 3 years ago

You're right about the reprex; tempdir() didn't illustrate my question properly.

I think the issue was that I was using here::here() to pass paths to the dynamic file targets. I thought of these paths as relative because that's how I was entering them in, but their output is an absolute path. I changed the file paths in my targets to use file.path() instead, so now it should be fine.

shirdekel commented 3 years ago

This is actually still a problem for me, even after stripping my plan down to just a single target with a single character path for one of the directories that keep updating the plan from machine to machine. I struggled to create a reprex, though, because I'm using r_make(). I wrote the below, which doesn't illustrate the problem because for some reason the directories I create don't work, but I think it sketches the problem a bit better than my previous reprex. Any tips on making a better reprex here? Any ideas of how to debug this in general?

library(drake)
isolate_example("reprex", {
    library(fs)
    dir_create("x/z/dir")
    dir_create("y/z/dir")
    writeLines(
      c(
        "library(drake)",
        "my_plan = drake_plan(dir= target(\"dir\", format = \"file\"))",
        "drake_config(my_plan)"
      ),
      "x/z/_drake.R"
    )
    writeLines(
      c(
        "library(drake)",
        "my_plan = drake_plan(dir= target(\"dir\", format = \"file\"))",
        "drake_config(my_plan)"
      ),
      "y/z/_drake.R"
    )
    options(drake_source = "x/z/_drake.R")
    r_make()
    options(drake_source = "y/z/_drake.R")
    r_make()
})
#> ▶ target dir
#> Warning message:
#> missing dynamic files for target dir:
#>   dir 
#> ▶ target dir
#> Warning message:
#> missing dynamic files for target dir:
#>   dir

Created on 2020-12-24 by the reprex package (v0.3.0)

wlandau commented 3 years ago

I would stick with a simpler example that captures the problem you are facing. (Your example above looks like it is trying to get at something different.) It won't be a "reprex" exactly because it requires running the files on a different machine, but it can still get the point across.

library(drake)
writeLines("contents", "x.txt")
plan <- drake_plan(out = target("x.txt", format = "file"))
make(plan)
# Transfer everything to a different machine
library(drake)
plan <- drake_plan(out = target("x.txt", format = "file"))
make(plan)

If files are invalidating from machine to machine, it has to mean the hash of the file is changing somehow. So whatever the equivalent of x/z/dir is in your real use case, one thing to check is digest::digest(file = TRUE) on each of the files inside.

But this may not be caused by files at all. One thing that can throw people off in general is different packages on different machines if you use namespaced function calls. If your plan has fs::dir_copy() and you have different versions of fs on different machines, then fs::dir_copy() may be different, which could invalidate the target. Sometimes these functions deparse and hash differently even if the package version is the same. So you can try either avoiding namespaced calls or using renv to set up a reproducible package library for your project.

shirdekel commented 3 years ago

I finally figured this out! Both your suggestions were very useful, but unfortunately not applicable for me because the problem directory was empty, and I am already using renv.

Turns out that this directory was only empty for one machine. For some reason, one machine had a .DS_Store file in the directory, while the other didn't! This seems to be because this directory is synced by Dropbox, and apparently they ignore .DS_Store files. So the target was invalidating because drake thought that I was adding/removing a file each time.