rstudio / rmarkdown

Dynamic Documents for R
https://rmarkdown.rstudio.com
GNU General Public License v3.0
2.88k stars 977 forks source link

_files directory not removed when cache is active for knitr #2408

Open jwhendy opened 2 years ago

jwhendy commented 2 years ago

I was trying to figure out why I ended up with a fname_files directory, despite using self_contained: yes in my document. That led me to this issue which suggested this should be fixed, but I was still experiencing this, and I traced it to cache=T.

Here's a test Rmd.

With the ggplot chunk as-is, all is well. If I add cache=T to the options, I get a test_files directory which is not removed after rendering. The file really is self-contained. I can move it (after renaming and bypassing the annoying "this file will no longer be owned by the directory test_files" message) and open it fine, and the page source shows the png image embedded.

Apologies if this is known/expected; I'm pretty new to Rstudio/knitr and am not very familiar with caching behavior.


Update pre-submit: aaannnd like most things, after I write everything up, I realized I missed something. I'm going to submit anyway; if nothing else it might resolve someone else's confusion/curiosity down the road.

In this comment, I noted the condition ... & !dir_exists(cache_dir). Does this mean that even with standalone documents, if one is using cache=T, one should always expect the linked _files directory?


R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042), RStudio 2022.7.1.554

Locale:
  LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3 bslib_0.4.0     cachem_1.0.6    digest_0.6.29   evaluate_0.16   fastmap_1.1.0   fs_1.5.2       
  glue_1.6.2      graphics_4.2.1  grDevices_4.2.1 highr_0.9       htmltools_0.5.3 jquerylib_0.1.4 jsonlite_1.8.0 
  knitr_1.40      magrittr_2.0.3  memoise_2.0.1   methods_4.2.1   R6_2.5.1        rappdirs_0.3.3  rlang_1.0.5    
  rmarkdown_2.16  sass_0.4.2      stats_4.2.1     stringi_1.7.8   stringr_1.4.1   tinytex_0.41    tools_4.2.1    
  utils_4.2.1     xfun_0.32       yaml_2.3.5     

Pandoc version: 2.18
cderv commented 2 years ago

Thanks for the report.

Does this mean that even with standalone documents, if one is using cache=T, one should always expect the linked _files directory?

Yes when cache = TRUE, it means for plots that the code chunk won't be reevaluated, and plot re saved to file. It means that the file where plot was saved the first time will be reused. As those files are written in the figure dir, the path to the figure is saved.

What is saved from the plot chunk is in fact the all generated HTML like this

"<img src=\"test_files/figure-html/unnamed-chunk-2-1.png\" width=\"672\" />"

This is because knitr options allows to tweak some HTML attributes, so it needs to be saved.

This is why we need to keep the _files folder. self_contained = TRUE will have an effect only after the knitr steps, when Pandoc will convert to HTML. This is were the files is encoded so that the HTML only depends on itself. The pandoc processing is not part of the cache, so we don't save the encoded plot, only the file in the knitr step.

I hope this helps understand the behavior. I can understand the confusion though.

notes: for more advanced users, cache setting can be controlled so that file is not saved and plot results in R is saved only, so that plot is rewritten to file - however, we don't recommend that. More about cache: https://yihui.org/knitr/demo/cache/, https://bookdown.org/yihui/rmarkdown-cookbook/cache.html


Idea @yihui if we want to adapt this:

knitr could save also the file in the cache _cache folder and copy it in figure dir during the knit when cache is used. This way, if self_contained = TRUE, figure dir would be removed and only the _cache directory would need to be saved.

Currently, if someone wants to save the cache file (on CI for example, it will require to save the *_cache dir AND the _files dir. Not that complex but could be confusing.

Was it something you tried in the past ?


Full example to reproduce with above file ````r dir.create(tmp_dir <- tempfile()) owd <- setwd(tmp_dir) url <- "https://gist.githubusercontent.com/jwhendy/f2f8023f6a520dec938c544b828aa440/raw/bd1369bc17c1096c958ab9b640f814dfca96f049/test.Rmd" xfun::download_file(url) #> [1] 0 rmd <- basename(url) content <- xfun::read_utf8(rmd) content[15] <- gsub("\\}$", " , cache = TRUE}", content[15]) xfun::write_utf8(content, rmd) xfun::file_string(rmd) #> --- #> title: "test" #> output: #> html_document: #> self_contained: yes #> editor_options: #> chunk_output_type: console #> --- #> #> ```{r, echo=F, message=F} #> library(dplyr) #> library(ggplot2) #> ``` #> #> ```{r, echo=F, message=F , cache = TRUE} #> ggplot(mtcars, aes(x = mpg)) + #> geom_histogram(binwidth = 5, color="white") #> ``` rmarkdown::render(rmd, quiet = TRUE) #> #> Attachement du package : 'dplyr' #> Les objets suivants sont masqués depuis 'package:stats': #> #> filter, lag #> Les objets suivants sont masqués depuis 'package:base': #> #> intersect, setdiff, setequal, union fs::dir_tree() #> . #> ├── test.html #> ├── test.Rmd #> ├── test_cache #> │ └── html #> │ ├── unnamed-chunk-4_8a7807240fa030296e9595005d997eb3.RData #> │ ├── unnamed-chunk-4_8a7807240fa030296e9595005d997eb3.rdb #> │ ├── unnamed-chunk-4_8a7807240fa030296e9595005d997eb3.rdx #> │ └── __packages #> └── test_files #> └── figure-html #> └── unnamed-chunk-4-1.png setwd(owd) unlink(tmp_dir, recursive = TRUE) ````
yihui commented 2 years ago

I haven't tried that idea before, but it sounds like a good idea. Users can do it by themselves, though, by setting knitr::opts_chunk$set(fig.path = knitr::opts_chunk$get('cache.path') inside the Rmd document.

You have explained the problem clearly, and I don't have anything to add. In this case, the *_files directory is indeed not needed anymore for self_contained = TRUE, only if the *.html output file doesn't need to be regenerated in future. If we delete the *_files directory and regenerate *.html, we will run into an error (plot files not found).

jwhendy commented 2 years ago

I hope this helps understand the behavior. I can understand the confusion though.

I tried... though I admit the nuances between what happens with cache=T and standalone are not so clear. That said, I don't necessarily need to understand :)

The practical issue I found bothersome was the whole "this file is owned by this directory." I ran into this as I wanted to upload my output .html to my team site and Windows brought the directory along with it. If we could break that link, the need to copy files is covered, and I could just delete the _files directories when I happened to see them (if I wanted).

If there's some additional improvement, I'm of course also all for it.

jwhendy commented 2 years ago

Another nuance to this I wanted to bring up. I was using this trick so that I could maintain a notebooks directory separate from my generated output. In my header, I have:

knit: (function(input, ...) {rmarkdown::render(input, output_dir = "../output")})

Even with no caching, this results in a persisting _files directory. Not sure it's trivial to pass a different output directory around to the internals for cleanup, but wanted to mention it. My current idea for a workflow is not working due to this, as my hope was to keep my notebooks directory clean, but either (a) it gets littered with the output files and I have to move them manually or (b) my output directory has these htmls with inherited ownership I can't share easily as they want to bring the directory with them.

allefeld commented 4 months ago

@cderv, sorry for picking up this old thread, but I ran into the same issue and I am still confused after your explanation.

I understand that if the output of a cached chunk contains a reference to an external file, that file needs to be preserved. I would prefer it if it were preserved in the _cache directory, but okay.

However, I have the case that I have two subsequent R chunks in a qmd file. The first runs a simulation and is cached. The second creates a plot based on data generated in the simulation. The cached chunk does not create a plot, and the chunk creating the plot is not cached. However, I still get a _files directory despite embed-resources: true.

I don't see the need to preserve this image file, since it is created by code which runs always (based on data which is either generated or read from the cache).

cderv commented 4 months ago

@allefeld this could be a Quarto issue and not a Rmarkdown issue. Can you share example with Quarto ? Or did you reproduce also without Quarto and .Rmd file only ?

I am mentioning this, because Quarto is also handling intermediary folder deletion, and can take over R Markdown / Knitr in some cases. We would need to check this.

allefeld commented 4 months ago

I encountered this problem with Quarto, then searched for a solution and found this issue. Example, file cacheplot.qmd:

---
embed-resources: true
---

```{r}
#| cache: true
df <- data.frame(
    x = rnorm(10),
    y = rnorm(10)
)
library(ggplot2)
ggplot(df) +
  aes(x = x, y = y) +
  geom_point()

Processing and result:

$ quarto render cacheplot.qmd

processing file: cacheplot.qmd 1/5
2/5 [unnamed-chunk-1] 3/5
4/5 [unnamed-chunk-2] 5/5
output file: cacheplot.knit.md

pandoc to: html output-file: cacheplot.html standalone: true embed-resources: true section-divs: true html-math-method: mathjax wrap: none default-image-extension: png

metadata document-css: false link-citations: true date-format: long lang: en

Output created: cacheplot.html

$ tree . ├── cacheplot_cache │   └── html │   ├── __packages │   ├── unnamed-chunk-1_1f3cf5ef4d9a16d2aeb67486d75e2a0d.RData │   ├── unnamed-chunk-1_1f3cf5ef4d9a16d2aeb67486d75e2a0d.rdb │   └── unnamed-chunk-1_1f3cf5ef4d9a16d2aeb67486d75e2a0d.rdx ├── cacheplot_files │   └── figure-html │   └── unnamed-chunk-2-1.png ├── cacheplot.html └── cacheplot.qmd

5 directories, 7 files


But the same happens with R Markdown; file `cacheplot.Rmd` with the same contents as `cacheplot.qmd`. Knitting from within RStudio prints to the Render tab:

|.......................... | 50% [unnamed-chunk-1]

processing file: cacheplot.Rmd

output file: cacheplot.knit.md

/usr/local/bin/pandoc +RTS -K512m -RTS cacheplot.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output cacheplot.html --lua-filter /home/ca/R/x86_64-pc-linux-gnu-library/4.4/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/ca/R/x86_64-pc-linux-gnu-library/4.4/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --template /home/ca/R/x86_64-pc-linux-gnu-library/4.4/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/RtmpFxCGmA/rmarkdown-str25fe3af0008c.html [WARNING] This document format requires a nonempty element. Defaulting to 'cacheplot.knit' as the title. To specify a title, use 'title' in metadata or --metadata title="...".</p> <p>Output created: cacheplot.html</p> <pre><code> Files left:</code></pre> <p>$ tree . ├── cacheplot_cache │   └── html │   ├── __packages │   ├── unnamed-chunk-1_b5b0e826442eccad0bb796bab4405671.RData │   ├── unnamed-chunk-1_b5b0e826442eccad0bb796bab4405671.rdb │   └── unnamed-chunk-1_b5b0e826442eccad0bb796bab4405671.rdx ├── cacheplot_files │   └── figure-html │   └── unnamed-chunk-2-1.png ├── cacheplot.html └── cacheplot.Rmd</p> <p>5 directories, 7 files</p> <pre><code> I'd be happy to create an issue on `quarto-cli` if that is more useful. Debian 12.5, Quarto 1.5.37, R Studio 2023.12.0 Build 369, R 4.4.1, knitr 1.47, rmarkdown 2.27</code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/cderv"><img src="https://avatars.githubusercontent.com/u/6791940?v=4" />cderv</a> commented <strong> 4 months ago</strong> </div> <div class="markdown-body"> <blockquote> <p>But the same happens with R Markdown; file cacheplot.Rmd with the same contents as cacheplot.qmd.</p> </blockquote> <p>So this is definitely related. Looking at the code, we do not clean the figure dir when a cache directory is existing. <a href="https://github.com/rstudio/rmarkdown/blob/4b2f3426c1f46a8d207b0d661fc51c2508ecdad7/R/render.R#L824-L838">https://github.com/rstudio/rmarkdown/blob/4b2f3426c1f46a8d207b0d661fc51c2508ecdad7/R/render.R#L824-L838</a></p> <p>So this would be the part to improved. Probably based on some knitr information is the supporting directory can be deleted. </p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>