quarto-dev / quarto-cli

Open-source scientific and technical publishing system built on Pandoc.
https://quarto.org
Other
3.99k stars 328 forks source link

Provide examples of chunk cache invalidation #1872

Open torfason opened 2 years ago

torfason commented 2 years ago

I've been trying to get code chunk cache invalidation to work based on a global variable (which will eventually be a string with a digest hash of several my data set. For now, I would be content with the following.

In code chunk 1, I set my_variable:

my_variable <- "a"

Then in code chunk B, I do a plot, that gets cached, unless I change the value of my_variable in code chunk 1:

#| cache: true
#| cache-vars: my_variable
#| cache-globals: my_variable
#| autodep: true

ggplot(...)

I've been trying around various combination of the chunk settings using the new Quarto syntax. But none of the combinations seems to trigger a refresh of the cached chunk.

An immediate solution would be to have an example here, but even better would be if the documentation (for example at https://quarto.org/docs/reference/cells/cells-knitr.html#cache ) could be a bit clearer on how to achieve this.

torfason commented 2 years ago

Update: I did find that the old style can be used in this case:

```{r, cache.vars=my_variable}
ggplot(...)
...

So this is kind of resolved, although I think it would be nicer to be able to use the new yaml-style syntax for this.

cderv commented 2 years ago

@torfason we don't have a specific example for Quarto, but using knitr related document should work fine, as there should be no specific about Quarto.

See our examples about caching in the R Markdown Cookbook (https://bookdown.org/yihui/rmarkdown-cookbook/cache.html#cache)

I've been trying to get code chunk cache invalidation to work based on a global variable (which will eventually be a string with a digest hash of several my data set

The way to do that correctly with knitr cache, would be using cache.extra as documented in the link above. This is more a convention, than a real chunk option as it can take any value and won't be used except for having one chunk option changing, which will invalidate the cache (More on knitr cache invalidation: https://yihui.org/en/2018/06/cache-invalidation/)

So I would do it like this using rlang::hash() for a R object, or rlang::hash_file() if you have files

---
title: "Using cache with knitr"
format: html
---

```{r}
my_variable <- "b"
#| cache: true
library(ggplot2)
Sys.sleep(5) # added to check cache is used
ggplot(mpg, aes(displ, hwy, colour = class)) + 
  geom_point()

`cache.vars` and `cache.globals` are for more advanced usage, and tweaking `cache.vars` could have the side effect to modified the default behavior (which is the above example is to invalidate also if `mpg` change. 

Though you are trying to use `autodeps` so using `cache.globals` is justified

> So this is kind of resolved, although I think it would be nicer to be able to use the new yaml-style syntax for this.

Using YAML form is working - but you need to remember this is YAML code, and no R code as usual chunk option when passed in ```` ```{r, key=val} ````

This means in 
````yaml
#| cache: true
#| cache-vars: my_variable
#| cache-globals: my_variable
#| autodep: true

my_variable is a string and not the R variable you want to. You need to use specific key: : !expr val for that. This is mention in our chunk option documentation for knitr: https://quarto.org/docs/computations/r.html#chunk-options

It is one of the difference between both syntax. Using the previous form will still be working though.

For the chunk in the example above this would mean

```{r}
#| cache: true
#| cache-extra: !expr rlang::hash(my_variable)
library(ggplot2)
Sys.sleep(5) # added to check cache is used
ggplot(mpg, aes(displ, hwy, colour = class)) + 
  geom_point()


Hope this will help you understand. 

I'll try to add an example specific set of examples we are building. Thank you for asking
torfason commented 2 years ago

Thank you for the detailed response.

Using YAML form is working - but you need to remember this is YAML code, and no R code as usual chunk option when passed in ```{r, key=val}

I think this is key. I was not able to remember this being YAML code (and thus not evaluated) because I had not seen this form of YAML before :-). But I was deducing that something like this was probably happening, and now I know.

I had not caught the description of this in the chunk options documentation. If it's new, great, if it was there all along, sorry I missed it, in particular the one about:

#| fig-cap: !expr paste("Air", "Quality")

which would probably have been enough for me to put this together for my caching scenario.

It is, however, getting to be quite a bit of syntax with the #| and then the !expr. I wonder if this !expr is Quarto specific or something from YAML in general, and if there was a way to have a shorter form of this sequence. But this is clearly the way to go, since as the documentation says, the comment style chunk options are more readable than the old style.

cderv commented 2 years ago

It is, however, getting to be quite a bit of syntax with the #| and then the !expr. I wonder if this !expr is Quarto specific or something from YAML in general, and if there was a way to have a shorter form of this sequence. But this is clearly the way to go, since as the documentation says, the comment style chunk options are more readable than the old style.

!expr is from the YAML parser in R, as a specific YAML handler for R code. See man page: https://rdrr.io/cran/yaml/man/yaml.load.html

There is a built-in handler that will evaluate expressions that are tagged with the ‘!expr’ tag. Currently this handler is disabled by default for security reasons. If a ‘!expr’ tag exists and this is set to FALSE a warning will occur. Alternately, you can set the option named ‘yaml.eval.expr’ via the options function to turn on evaluation.

We could come up with a shorter handler (e.g !r instead of !expr). However, not sure it is easier.

This is a feature only available to knitr computation engine. For other computation engine, there is no such feature.

I guess when needed to pass some values to R code chunk, using also the older knitr form would work fine. But it is mixing things, not ideal.

Anyway, hopefully the doc is enough, and I'll add an example of this in addition. Thanks for asking !

torfason commented 2 years ago

Thanks, it is helpful to understand how these things hang together. Feel free to close the issue as the solution (how to invalidate chunk cache using comment-style settings) has been explained, and documentation to the approach pointed out.

I have a final thought, because allowing arbitrary R code, as with !expr will always by definition be limited to a single platform.

The thought – and this would of course be a completely separate issue – is whether going forward, it would make sense to have a minimal cross-platform templating language embedded in Quarto yaml processing. The absolute-minimum functionality would be to just allow the retrieval of a single variable in the underlying language, whatever that language might be. Perhaps:

#| cache-extra: "{{my_hash}}"

And then I would just have to make sure to have run my_hash <- rlang::hash(my_tibble) in an initialization chunk. And if I had instead been using python, I would use equivalent code in python to create a variable.

The choice of templating language/syntax would be a design choice, but I went for {{my_hash}} because it would make the templating language be a subset of mustache, and as long as it is in quotes it seems to be legal YAML.

Mustache seems to be a (if not the) top templating engine out there, and has both Python and R implementations (for R it is whisker, for Python it is chevron). So in fact, one way to implement this could be to just run any yaml chunk parameters through these packages depending on what would be the underlying engine.

cderv commented 2 years ago

This is an interesting idea. I mean setting a chunk option using shortcode form like we have for variables

This would work well for string object like a hash. However this would mean:

I don't have an idea of the feasibility of this - currently I don't think we have a way to have impact from a computation chunk to the rendering process.

@dragonstyle, just pigging you here in case you have some thoughts. I don't if this is something we could think of in the future. And I don't know if this is a topic we plan to think about : How to make use of computed variable content in other chunk options in a cross language way.

torfason commented 2 years ago

And still I'm learning! I did not know about the Quarto shortcodes. That would of course be the way to go (if one were going at all). And I agree that this would involve two independent (but also potentially independently valuable) steps, both evaluating shortcodes in YAML (which would even now allow chunk options to be set from environment variables, it seems) and separately to use values from computational chunks in shortcodes (which could be useful for shortcodes both in YAML and other places).

For being able to set values to be available for short codes, it seems quite acceptable to not try to pick up all the variables in the R environment, but to require an explicit command to set it, such as something like (in the R engine):

quarto::set_shortcode_var("my_hash", rlang::hash("my_tibble"))

Thanks for engaging with these ideas, I'll keep my eyes open for what happens in the future, but am happy to get a better understanding for my own use-cases in the meantime.