rstudio / rmarkdown

Dynamic Documents for R
https://rmarkdown.rstudio.com
GNU General Public License v3.0
2.88k stars 977 forks source link

Application data file get deleted by clean=TRUE on render() #1095

Open Pablo-Leon opened 7 years ago

Pablo-Leon commented 7 years ago

The intermediates cleaning mecanism is deleting a datafile read by the program.

The Rmd: Test_LostFile.Rmd.txt

Rmd Code ````markdown --- title: "Test_LostFile" author: "plr" date: "5 de julio de 2017" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(readr) ``` ```{r } file <- "../temp/LostFile.csv" dfX <- data.frame( l=letters, n=1:length(letters)) write.csv(dfX, file, row.names = FALSE) dfY <- read_delim( file ,delim="," ,col_names=TRUE ,col_type= cols( l = col_character() ,n = col_double() )) ``` ````

With this command line:

export R_LIBS="C:/Users/Me/Documents/R/win-library/3.4"; export LANGUAGE="en_US.utf8"; \
        time Rscript -e "rmarkdown::render('src/Test_LostFile.Rmd', output_file='Test_LostFile.html', clean = TRUE, run_pandoc=FALSE, output_dir='reps', intermediates_dir='tmp.xxx')"

Under cygwin on Win7.

First run

On the first run the program works fine and throw this output ('cause run_pandoc=FALSE):

...
output file: C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/Test_LostFile.knit.md

[1] "C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/Test_LostFile.knit.md"
attr(,"knit_meta")
list()
attr(,"intermediates")
[1] "C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/Test_LostFile.knit.md"
[2] "C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/Test_LostFile.utf8.md"
[3] "C:\\BitSync\\INE\\Ensayo2016\\Adapt2Censo\\reps\\Test_LostFile_files"

$ ls -ltr temp/LostFile.csv
-rwx------+ 1 Me Domain Users 208 Jul  5 10:51 temp/LostFile.csv

The second time:

output file: C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/Test_LostFile.knit.md

[1] "C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/Test_LostFile.knit.md"
attr(,"knit_meta")
list()
attr(,"intermediates")
[1] "C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/Test_LostFile.knit.md"
[2] "C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/Test_LostFile.utf8.md"
**[3] "C:/BitSync/INE/Ensayo2016/Adapt2Censo/tmp.xxx/../temp/LostFile.csv"**
[4] "C:\\BitSync\\INE\\Ensayo2016\\Adapt2Censo\\reps\\Test_LostFile_files"

And the file get lost: $ ls -ltr temp/LostFile.csv ls: cannot access 'temp/LostFile.csv': No such file or directory

The problem seems to be the the combination of the ways :

The work around was to use rprojroot package to form an absolute path relative to project base. In this manner the misidentification of the file is avoided.

regards

rich-iannone commented 6 years ago

@Pablo-Leon I've just tried to reproduce this. Testing environment was macOS and not Cygwin on Windows. I saw the exact same output as you for the first and second runs. However, Lostfile.csv is still present in the temp directory.

Could you perhaps upgrade your rmarkdown/rstudio to the latest and try this again?

pgg1309 commented 6 years ago

Hi @rich-iannone , I have rmarkdown 1.10 and the same problem happens to me. When the .rmd document reads a file from my computer and the option clean = TRUE is used, then rmarkdown::render() ends up deleting the file I've read the data from.

Let me know if there is any diagnostic that I can send you to help fixing this bug. Thanks.

cderv commented 2 years ago

This is still an odd behavior with last version of the tools. I used this code to test

dir.create(tmp_dir <- tempfile())
owd <- setwd(tmp_dir)
dir.create("src")
xfun::in_dir(
  "src",
  xfun::download_file(
    "https://github.com/rstudio/rmarkdown/files/1125122/Test_LostFile.Rmd.txt", 
    "Test.Rmd")
)
dir.create("temp")
fs::dir_tree()

# FIRST RUN
rmarkdown::render('src/test.Rmd', 
                  output_file='Test_LostFile.html', 
                  clean = TRUE, 
                  run_pandoc=FALSE, 
                  output_dir='reps', 
                  intermediates_dir='tmp.xxx')

# File is there
fs::dir_tree(recurse = TRUE)

# SECOND RUN 
rmarkdown::render('src/test.Rmd', 
                  output_file='Test_LostFile.html', 
                  clean = TRUE, 
                  run_pandoc=FALSE, 
                  output_dir='reps', 
                  intermediates_dir='tmp.xxx')
# File is deleted
fs::dir_tree(recurse = TRUE)
list.files(recursive = TRUE, include.dirs = TRUE)

setwd(owd)
unlink(tmp_dir, recursive = TRUE)

I am not quite sure why this happens only on second runs, however the file is removed because it is found as part of the intermediates files when this runs https://github.com/rstudio/rmarkdown/blob/8e2ea3ce0626bc9aa20a009d1e1c288da15af78a/R/render.R#L511-L518

This is triggered only when an intermediate dir is set, but it seems it finds resources outside and may not behave as expected. More details on the behavior.

The html_document_base intermediate generator will find the CSV file at some point as find_external_resources() will find it. Tested with find_external_resources("src/test.Rmd") after the first run.

What happens is Rmd file will be purled to detect external ressources https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/html_resources.R#L362-L365 Using a static analysis of quoted string to check if they could be relative filepath https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/html_resources.R#L386-L389

On first pass, the CSV file does not exist before knitting so it is not found https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/html_resources.R#L75-L79 On second pass it exists, so it will be found and added to intermediates.

I believe the issue rely in the fact that the found resource should be copied but it is not https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/html_resources.R#L404-L414

But copy_file_with_dir() will run file.copy like this

file.copy(
    "C:/Users/chris/AppData/Local/Temp/RtmpiO8b6u/file49002e53caf/src/../temp/LostFile.csv",
    "C:/Users/chris/AppData/Local/Temp/RtmpiO8b6u/file49002e53caf/tmp.xxx/../temp/LostFile.csv"
)

which is the same path considering the folder tree in the example. This dest file will be added to the intermediates and then removed when clean = TRUE

It seems like a weird bug with how paths are handled, and also due to the folder structure of the example. If I put the CSV file at the same level at the Rmd file, this will not happen

so using in the Rmd

file <- "LostFile.csv"

which then will be correctly found and copied to the intermediate dir. dest file will this one in intermediates which are removed.

C:/Users/chris/AppData/Local/Temp/RtmpiO8b6u/file49002e53caf/tmp.xxx/LostFile.csv

I believe this is an issue with relative file path using ... where the generated path for copy is not the right one.

And... another paths issue in the mix.