mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

rmarkdown/knitr workflow with batchtools and caching #166

Closed cfhammill closed 6 years ago

cfhammill commented 6 years ago

Hi,

I've been using batchtools as the backend for the neuroimaging library that I develop. I also use it for scripting and analysis. I recently converted my tools away from BatchJobs. So far the only issue I have is the new system does not play well with rmarkdown/knitr caching. There is a simple fix for one of the two problems, but it's not obvious what is happening at first.

The two issues I've run into are:

  1. Invisible mutation of the registry (i.e. batchMap) fools caching. Take for example the following document
    ---
    title: A Registry Caching Bug
    ---

    ```{r}
    library(batchtools)
```{r init_registry, cache = TRUE}
reg <- makeRegistry("rcb")
```

```{r experiment, cache = TRUE, dependson = "init_registry"}
jobs <- batchMap(identity, 1:5)
submitJobs(jobs)
waitForJobs()
```

```{r analysis, cache = TRUE, dependson = "experiment"}
##
reduceResultsList(reg = reg)
```


Render it with `rmarkdown::render` and you get the output you'd expect. Now decide you didn't like the post processing, that unused comment above `reduceResultsList` is unnecessary. Delete it and re-run. What should happen is it just re-reduces the results. What does happen is it picks up the version of the registry from 'init_registry' with no listed jobs and throws a warning and produces empty output. 

Unsurprisingly, there is a simple fix. If you add `reg <- reg` after `waitForJobs` rmarkdown/knitr will notice that reg has changed in the 'experiment' block. 

2.  Suppose some small fraction of you jobs fail. You want to go in and finish the run without invalidating your knitr/rmarkdown cache. So you open up an rsession, find the failed jobs, re-run them and hurray everything finishes. Now you go back to your rmarkdown document and start adding post processing chunks. But none of these work because batchtools rightly notices that the registry has changed behind the scenes. Now the cached version of your registry is irretrievably invalidated. And as far as I can tell, there is no way to unlock it after this. **Addendum: this may not be a common issue, it possibly only occurs if the two R sessions are open simultaneously**

So my questions are:

1. Are there general purpose guideline for using batchtools with caching? If not should there be? If not should users be discouraged from using these together.
2. Is there a way to fix a locked registry after invalidation? 

Thanks!
wlandau-lilly commented 6 years ago

@cfhammill, this may require a change of direction, but have you considered drake? (I am the package maintainer). It is a pipeline toolkit that chains together intermediate steps and only builds things that are out of date. In the basic example (load_basic_example(); make(my_plan)), all the data crunching happens first, and then the knitr report is rendered at the very end. You have the option to leverage batchtools using parallelism = "future_lapply".

library(drake)
library(future.batchtools)
future::plan(batchtools_local) # Choose other backends: https://github.com/HenrikBengtsson/future.batchtools/blob/master/README.md#choosing-batchtools-backend
load_basic_example()
make(my_plan, parallelism = "future_lapply")
wlandau-lilly commented 6 years ago

By the way, I recommend the development version of drake at https://github.com/wlandau-lilly/drake. The CRAN version is far behind.

cfhammill commented 6 years ago

This looks cool, thanks @wlandau-lilly, I haven't seen drake before but I have been looking for something like snakemake for R to replace my make/rmarkdown workflows. Still curious about a pure rmarkdown batchtools solution however.

mllg commented 6 years ago

Are there general purpose guideline for using batchtools with caching? If not should there be? If not should users be discouraged from using these together.

The caching mechanism of knitr/rmarkdown cannot detect if something has changed behind the scenes (e.g., more jobs completed), so it is of limited use here.

However, you could do the follwoing:

  1. Load the registry with loadRegistry(..., writeable = TRUE) before post-processing and explicitly disable the cache for this step.
  2. Add the knitr chunk option to invalidate the cache if the registry has changed:
    cache.extra = reg$mtime

    Note that the the modification time might not be the best indicator to check whether something has changed in the background. E.g., if you copy the file dir to a different directory, the mtime might change, depending on the operating system or file system mount options. I will add a unique checksum to the registry to test this more reliably.

Is there a way to fix a locked registry after invalidation?

There is a simple workaround:

reg = getDefaultRegistry()
reg$writeable = TRUE
setDefaultRegistry(reg)

But note that this is a very dangerous operation. I'd rather re-load the registry with loadRegistry(..., writeable = TRUE), otherwise you risk a corrupt data base.

mllg commented 6 years ago

There is now the possibility to detect changes via reg$hash (172880c75a0e45767a46d7a157e0afaa6dc542ba).

cfhammill commented 6 years ago

Thanks @mllg.

The advantages of caching in this case are for large experiments, each step in my current use case takes minutes to hours, caching reduces the number of times steps need to be run. It's not really to re-run as results are coming in, I just want want the ability to do registry surgery once and then keep going.

I can see a couple issues with solution one. First, if the registry is large, reloading from disk can be quite slow. I'd rather not reload unnecessarily. Second, if the user modifies the registry away from their config state, reloading loses those changes. Not to make this about my setup, but this is catastrophic in my case, after switching from BatchJobs I've gone config free. This defaults to interactive which I immediately change to torque. If I go to reload from disk, it goes back into interactive, reloads my exported packages (one of which depends on batchJobs still) which masks several batchtools functions and causes downstream chaos.

As for the solution to Q2, I thought I had tried essentially this approach and failed, but I may have made a mistake and not set the default registry (I usually pass around the registry explicitly still). I got into a weird state where no matter what I did the registry would load in as unwriteable, but maybe being explicit there would fix this.

Thanks again for the help and the great package!

mllg commented 6 years ago

To avoid loading the registry, does this knitr option work for you?

cache.extra = file.mtime(file.path(reg$file.dir, "registry.rds"))

As soon as the registry is altered in the background, the cache should get invalidated for the respective chunk, but to detect this the registry must not be loaded.

For the second problem, that temporary changes to the in-memory registry are lost: not sure how to solve this. However, I'm also not sure that I understood your setup. In order to avoid creating a configuration file, you instead opt to stick to the defaults and overwrite them immediately after loading the registry in your rmarkdown?

cfhammill commented 6 years ago

I'm pretty sure that approach will work. I think that essentially solves the cache missing background changes to the registry. Although I think that solves a different problem from the reg <- reg solution I mentioned above. Depending on the change time (or the hash in the next release) just seems like good practice. To ensure that all changes to the registry in memory are preserved across chunks explicit reassignment appears necessary.

For the second issue, maybe the registry could save the current cluster functions and default resources, but the these get wiped by loadRegistry unless the user requests to keep them. Only a small change would be necessary https://github.com/mllg/batchtools/blob/master/R/saveRegistry.R#L24 and somewhere in loadRegistry. And yes I haven't set a config file for my system, primarily because the R_BATCHTOOLS_SEARCH_PATH variable didn't exist when I first converted and had trouble setting a global config file with our module system. So I opted to set them explicitly in each script.

This problem is probably niche enough to ignore, now that I know about the environment var I'll set a global config. Although I can imagine situations where preserving cluster functions and default resources across registry loads could come in handy.

cfhammill commented 6 years ago

Closing because this isn't a precise questions or issue at this point and has no clear direction forward.