Open jwhendy opened 2 years ago
Thanks for the report.
Does this mean that even with standalone documents, if one is using cache=T, one should always expect the linked _files directory?
Yes when cache = TRUE
, it means for plots that the code chunk won't be reevaluated, and plot re saved to file. It means that the file where plot was saved the first time will be reused. As those files are written in the figure dir, the path to the figure is saved.
What is saved from the plot chunk is in fact the all generated HTML like this
"<img src=\"test_files/figure-html/unnamed-chunk-2-1.png\" width=\"672\" />"
This is because knitr options allows to tweak some HTML attributes, so it needs to be saved.
This is why we need to keep the _files
folder. self_contained = TRUE
will have an effect only after the knitr steps, when Pandoc will convert to HTML. This is were the files is encoded so that the HTML only depends on itself. The pandoc processing is not part of the cache, so we don't save the encoded plot, only the file in the knitr step.
I hope this helps understand the behavior. I can understand the confusion though.
notes: for more advanced users, cache setting can be controlled so that file is not saved and plot results in R is saved only, so that plot is rewritten to file - however, we don't recommend that. More about cache: https://yihui.org/knitr/demo/cache/, https://bookdown.org/yihui/rmarkdown-cookbook/cache.html
Idea @yihui if we want to adapt this:
knitr could save also the file in the cache _cache
folder and copy it in figure dir during the knit when cache is used. This way, if self_contained = TRUE
, figure dir would be removed and only the _cache
directory would need to be saved.
Currently, if someone wants to save the cache file (on CI for example, it will require to save the *_cache
dir AND the _files
dir. Not that complex but could be confusing.
Was it something you tried in the past ?
I haven't tried that idea before, but it sounds like a good idea. Users can do it by themselves, though, by setting knitr::opts_chunk$set(fig.path = knitr::opts_chunk$get('cache.path')
inside the Rmd document.
You have explained the problem clearly, and I don't have anything to add. In this case, the *_files
directory is indeed not needed anymore for self_contained = TRUE
, only if the *.html
output file doesn't need to be regenerated in future. If we delete the *_files
directory and regenerate *.html
, we will run into an error (plot files not found).
I hope this helps understand the behavior. I can understand the confusion though.
I tried... though I admit the nuances between what happens with cache=T
and standalone
are not so clear. That said, I don't necessarily need to understand :)
The practical issue I found bothersome was the whole "this file is owned by this directory." I ran into this as I wanted to upload my output .html to my team site and Windows brought the directory along with it. If we could break that link, the need to copy files is covered, and I could just delete the _files
directories when I happened to see them (if I wanted).
If there's some additional improvement, I'm of course also all for it.
Another nuance to this I wanted to bring up. I was using this trick so that I could maintain a notebooks
directory separate from my generated output. In my header, I have:
knit: (function(input, ...) {rmarkdown::render(input, output_dir = "../output")})
Even with no caching, this results in a persisting _files
directory. Not sure it's trivial to pass a different output directory around to the internals for cleanup, but wanted to mention it. My current idea for a workflow is not working due to this, as my hope was to keep my notebooks directory clean, but either (a) it gets littered with the output files and I have to move them manually or (b) my output directory has these htmls with inherited ownership I can't share easily as they want to bring the directory with them.
@cderv, sorry for picking up this old thread, but I ran into the same issue and I am still confused after your explanation.
I understand that if the output of a cached chunk contains a reference to an external file, that file needs to be preserved. I would prefer it if it were preserved in the _cache
directory, but okay.
However, I have the case that I have two subsequent R chunks in a qmd file. The first runs a simulation and is cached. The second creates a plot based on data generated in the simulation. The cached chunk does not create a plot, and the chunk creating the plot is not cached. However, I still get a _files
directory despite embed-resources: true
.
I don't see the need to preserve this image file, since it is created by code which runs always (based on data which is either generated or read from the cache).
@allefeld this could be a Quarto issue and not a Rmarkdown issue. Can you share example with Quarto ? Or did you reproduce also without Quarto and .Rmd file only ?
I am mentioning this, because Quarto is also handling intermediary folder deletion, and can take over R Markdown / Knitr in some cases. We would need to check this.
I encountered this problem with Quarto, then searched for a solution and found this issue. Example, file cacheplot.qmd
:
---
embed-resources: true
---
```{r}
#| cache: true
df <- data.frame(
x = rnorm(10),
y = rnorm(10)
)
library(ggplot2)
ggplot(df) +
aes(x = x, y = y) +
geom_point()
Processing and result:
$ quarto render cacheplot.qmd
processing file: cacheplot.qmd
1/5
2/5 [unnamed-chunk-1]
3/5
4/5 [unnamed-chunk-2]
5/5
output file: cacheplot.knit.md
pandoc to: html output-file: cacheplot.html standalone: true embed-resources: true section-divs: true html-math-method: mathjax wrap: none default-image-extension: png
metadata document-css: false link-citations: true date-format: long lang: en
Output created: cacheplot.html
$ tree . ├── cacheplot_cache │ └── html │ ├── __packages │ ├── unnamed-chunk-1_1f3cf5ef4d9a16d2aeb67486d75e2a0d.RData │ ├── unnamed-chunk-1_1f3cf5ef4d9a16d2aeb67486d75e2a0d.rdb │ └── unnamed-chunk-1_1f3cf5ef4d9a16d2aeb67486d75e2a0d.rdx ├── cacheplot_files │ └── figure-html │ └── unnamed-chunk-2-1.png ├── cacheplot.html └── cacheplot.qmd
5 directories, 7 files
But the same happens with R Markdown; file `cacheplot.Rmd` with the same contents as `cacheplot.qmd`. Knitting from within RStudio prints to the Render tab:
|.......................... | 50% [unnamed-chunk-1]
processing file: cacheplot.Rmd
output file: cacheplot.knit.md
/usr/local/bin/pandoc +RTS -K512m -RTS cacheplot.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output cacheplot.html --lua-filter /home/ca/R/x86_64-pc-linux-gnu-library/4.4/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/ca/R/x86_64-pc-linux-gnu-library/4.4/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --template /home/ca/R/x86_64-pc-linux-gnu-library/4.4/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/RtmpFxCGmA/rmarkdown-str25fe3af0008c.html [WARNING] This document format requires a nonempty
Output created: cacheplot.html
Files left:
$ tree . ├── cacheplot_cache │ └── html │ ├── __packages │ ├── unnamed-chunk-1_b5b0e826442eccad0bb796bab4405671.RData │ ├── unnamed-chunk-1_b5b0e826442eccad0bb796bab4405671.rdb │ └── unnamed-chunk-1_b5b0e826442eccad0bb796bab4405671.rdx ├── cacheplot_files │ └── figure-html │ └── unnamed-chunk-2-1.png ├── cacheplot.html └── cacheplot.Rmd
5 directories, 7 files
I'd be happy to create an issue on `quarto-cli` if that is more useful.
Debian 12.5, Quarto 1.5.37, R Studio 2023.12.0 Build 369, R 4.4.1, knitr 1.47, rmarkdown 2.27
But the same happens with R Markdown; file cacheplot.Rmd with the same contents as cacheplot.qmd.
So this is definitely related. Looking at the code, we do not clean the figure dir when a cache directory is existing. https://github.com/rstudio/rmarkdown/blob/4b2f3426c1f46a8d207b0d661fc51c2508ecdad7/R/render.R#L824-L838
So this would be the part to improved. Probably based on some knitr information is the supporting directory can be deleted.
I was trying to figure out why I ended up with a
fname_files
directory, despite usingself_contained: yes
in my document. That led me to this issue which suggested this should be fixed, but I was still experiencing this, and I traced it tocache=T
.Here's a test Rmd.
With the ggplot chunk as-is, all is well. If I add
cache=T
to the options, I get atest_files
directory which is not removed after rendering. The file really is self-contained. I can move it (after renaming and bypassing the annoying "this file will no longer be owned by the directory test_files" message) and open it fine, and the page source shows the png image embedded.Apologies if this is known/expected; I'm pretty new to Rstudio/knitr and am not very familiar with caching behavior.
Update pre-submit: aaannnd like most things, after I write everything up, I realized I missed something. I'm going to submit anyway; if nothing else it might resolve someone else's confusion/curiosity down the road.
In this comment, I noted the condition
... & !dir_exists(cache_dir)
. Does this mean that even with standalone documents, if one is usingcache=T
, one should always expect the linked_files
directory?