rstudio / plumber

Turn your R code into a web API.
https://www.rplumber.io
Other
1.39k stars 255 forks source link

in parseUTF8() on Windows with enc "unknown", src is ignored / / overwritten #916

Closed ArcadeAntics closed 10 months ago

ArcadeAntics commented 1 year ago
> sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  English_Canada.utf8
 ctype    English_Canada.utf8
 tz       America/Toronto
 date     2023-09-07
 rstudio  2023.06.1+524 Mountain Hydrangea (desktop)
 pandoc   2.12 @ C:\\Users\\iris\\anaconda3\\Scripts\\pandoc.exe

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version  date (UTC) lib source
 cli           3.6.1    2023-03-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2    2022-09-29 [1] CRAN (R 4.3.0)
 essentials  * 0.3.0.13 2023-08-01 [1] local
 jsonlite      1.8.5    2023-06-05 [1] CRAN (R 4.3.1)
 later         1.3.1    2023-05-02 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3    2022-10-07 [1] CRAN (R 4.3.0)
 magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.3.0)
 plumber       1.2.1    2022-09-06 [1] CRAN (R 4.3.1)
 promises      1.2.0.1  2021-02-11 [1] CRAN (R 4.3.0)
 R6            2.5.1    2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp          1.0.11   2023-07-06 [1] CRAN (R 4.3.1)
 rlang         1.1.1    2023-04-28 [1] CRAN (R 4.3.1)
 rstudioapi    0.15.0   2023-07-07 [1] CRAN (R 4.3.1)
 sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.3.1)
 stringi       1.7.12   2023-01-11 [1] CRAN (R 4.3.0)
 swagger       3.33.1   2020-10-02 [1] CRAN (R 4.3.1)
 this.path   * 2.0.0.7  2023-09-07 [1] local
 webutils      1.1      2020-04-28 [1] CRAN (R 4.3.1)

 [1] C:/Users/iris/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.1/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

Example application or steps to reproduce the problem

if (.Platform$OS.type != "windows") {
    stop("windows only bug")
} else local({
    FILE.R <- tempfile(fileext = ".R")
    on.exit(unlink(FILE.R))
    writeLines("5 + 6", FILE.R)
    exprs <- plumber:::parseUTF8(FILE.R)
    cat("original filename: ", FILE.R, "\n", sep = "")
    cat("srcfile  filename: "); print(attr(exprs, "srcfile"))
})

Describe the problem in detail

In parseUTF8() the comments claim that on Windows with encoding "unknown" the file must be re-encoded to native. I would agree with this, though I would think this is no longer necessary since the C runtime on Windows is now ucrt and the native encoding is now UTF-8. Regardless, it writes the lines into a new file in the native encoding.

It then claims that despite parsing a different file, the source reference is pointed to the original file. This is incorrect since parse() overwrites its argument srcfile when file is a character string and keep.source = TRUE.

To avoid srcfile being overwritten when passed to parse(), you should use parse(keep.source = FALSE) (unintuitively).

Here is some code showing the fix working as intended:

my_parseUTF8 <- function (file)
{
    lines <- plumber:::readUTF8(file)
    src <- srcfilecopy(file, lines, isFile = TRUE)
    file <- tempfile()
    on.exit(unlink(file))
    writeLines(lines, file)
    parse(file, keep.source = FALSE, srcfile = src)
}

if (.Platform$OS.type != "windows") {
    stop("windows only bug")
} else local({
    FILE.R <- tempfile(fileext = ".R")
    on.exit(unlink(FILE.R))
    writeLines("5 + 6", FILE.R)
    exprs <- my_parseUTF8(FILE.R)
    cat("original filename: ", FILE.R, "\n", sep = "")
    cat("srcfile  filename: "); print(attr(exprs, "srcfile"))
})
meztez commented 10 months ago

@ArcadeAntics how would this issue affect current plumber code? What would be the impact of not fixing it? I ask because I do not see any reference to the srcfile attribute being used or private$parsed being exposed.

The fix is simple enough. Just trying to gauge the ramification.

ArcadeAntics commented 10 months ago

@meztez it does not affect the package:plumber code itself, but it ruins my code. I want to know where each function is written, and the easiest way to do that is source references, but this bug messes that up. The impact of not fixing it is my code continues to not work and I have to implement some hackish solution to get it to behave the way I want. Fixing this will not negatively impact this package's code or anyone else's, only improve it.

meztez commented 10 months ago

@ArcadeAntics could you share a ruined piece of code? Maybe an example plumber file with the encoding that causes the issue?

meztez commented 10 months ago

Fixing this will not negatively impact this package's code or anyone else's, only improve it.

That may be true, or not, I'm not sure. What would be the impact of not executing this code block.

                if (keep.source) {
                  text <- readLines(file, warn = FALSE, encoding = encoding)
                  if (!length(text)) 
                    text <- ""
                  close(file)
                  file <- stdin()
                  srcfile <- srcfilecopy(filename, text, file.mtime(filename), 
                    isFile = TRUE)
                }

The input to .Internal(parse would not be the same.

meztez commented 10 months ago

Seems you are right, no meaningful impact.

aa <- function (file)
{
  lines <- plumber:::readUTF8(file)
  enc <- if (any(Encoding(lines) == "UTF-8"))
    "UTF-8"
  else "unknown"
  src <- srcfilecopy(file, lines, isFile = TRUE)
  exprs <- try(parse(file, keep.source = TRUE, srcfile = src,
                     encoding = enc))
  if (inherits(exprs, "try-error")) {
    stop("Error sourcing ", file)
  }
  exprs
}

bb <- function (file)
{
  lines <- plumber:::readUTF8(file)
  enc <- if (any(Encoding(lines) == "UTF-8"))
    "UTF-8"
  else "unknown"
  src <- srcfilecopy(file, lines, isFile = TRUE)

  file <- tempfile()
  on.exit(unlink(file), add = TRUE)
  writeLines(lines, file)

  exprs <- try(parse(file, keep.source = TRUE, srcfile = src,
                     encoding = enc))
  if (inherits(exprs, "try-error")) {
    stop("Error sourcing ", file)
  }
  exprs
}
cc <- function (file)
{
  lines <- plumber:::readUTF8(file)
  enc <- if (any(Encoding(lines) == "UTF-8"))
    "UTF-8"
  else "unknown"
  src <- srcfilecopy(file, lines, isFile = TRUE)

  file <- tempfile()
  on.exit(unlink(file), add = TRUE)
  writeLines(lines, file)

  exprs <- try(parse(file, keep.source = FALSE, srcfile = src,
                     encoding = enc))
  if (inherits(exprs, "try-error")) {
    stop("Error sourcing ", file)
  }
  exprs
}

aa1 <- aa("R/sample.R")
bb1 <- bb("R/sample.R")
cc1 <- cc("R/sample.R")

attributes(aa1)
attributes(bb1)
attributes(cc1)
> attributes(aa1)
$srcref
$srcref[[1]]
aaa <- "eric la chaperone"

$srcfile
R/sample.R 

$wholeSrcref
aaa <- "eric la chaperone"

> attributes(bb1)
$srcref
$srcref[[1]]
aaa <- "eric la chaperone"

$srcfile
/tmp/RtmpBvrTql/file260ca737a299a 

$wholeSrcref
aaa <- "eric la chaperone"

> attributes(cc1)
$srcref
$srcref[[1]]
aaa <- "eric la chaperone"

$srcfile
R/sample.R 

$wholeSrcref
aaa <- "eric la chaperone"

from an ISO-8852-1 sample.R file

aaa <- "eric la chaperone"
ArcadeAntics commented 10 months ago

@meztez even an ASCII encoded file will cause this issue:

function ()
NULL

I put this in a file ~/test-plumber.R and ran attr(plumber:::parseUTF8("~/test-plumber.R"), "srcfile") and it gave me C:\Users\iris\AppData\Local\Temp\Rtmpshydhs\file1a7c39064634. This also fails for scripts written in latin1 which used to be the native encoding before R 4.2.0. In my specific example, I had something like:

function ()
utils::getSrcFilename(sys.function(), full.names = TRUE)

which of course fails because the source reference of the function points to the wrong file.

Yes, the input to .Internal(parse()) is different, but it returns the same result. If you need convincing of that, look at the code of do_parse and R_ParseConn and R_ParseVector and do_readLines. You can see that they both do the same thing. And you might say "well look, readLines does character translation" but in this case, since encoding is unknown, no translation is done. But then you might say "look at do_parse, it sets known_to_be_latin1 and known_to_be_utf8 for character inputs" but again, because the encodings of the strings are not marked, it does not set either of these. The output from .Internal(parse()) will be completely identical.

meztez commented 10 months ago

@ArcadeAntics I came to the same conclusion. See #930.