Escaping of non UTF8 character in comments fails

cderv commented 3 years ago

This was first reported in https://github.com/rstudio/bookdown/issues/1260 by a Chinese user having chinese character in code chunks.

I am opening this issue to track it in the right place and help to solve it. And a PR already tries to fix this #112.

Also, I can reproduce on Windows French computer (also non UTF-8 by default) when using special accentuated character:

downlit::highlight("# é\n1:5")
#> Error in gsub("<U+2029>", "\033", text[is_comment], fixed = TRUE): input string 1 is invalid UTF-8

I believe the issue is that the text passed to token_escape() is # <e9> and no more the original # é. So somewhere the string is marked with incorrect encoding I think. This could be caused because any code that will be parsed is assumed to be UTF-8 https://github.com/r-lib/downlit/blob/e22f072cfeaf91fbe7c38aeabd351aaa184d36fd/R/utils.R#L49

encoding = UTF-8 here means that the text pass to parse() is assumed to be UTF-8, it won't do any conversion. In my case the text is latin-1, the default on my system. Forcing UTF-8 in downlit (https://github.com/r-lib/downlit/commit/9a0d670c1b317fbd77ee62d4b5589beee034fcc3) may require a conversion to UTF-8. It solves it at least on my side.

# String is latin-1
text <- "# é"
Encoding(text)
#> [1] "latin1"

# parsing keeps the encoding by default
native_parse <- utils::getParseData(parse(text = text, keep.source = TRUE))
native_parse
#>   line1 col1 line2 col2 id parent   token terminal text
#> 1     1    1     1    3  1      0 COMMENT     TRUE  # é
Encoding(utils::getParseText(native_parse, 1))
#> [1] "latin1"

# Assuming UTF-8 wrongly mark the result without doing a conversion
utf8_parse <- utils::getParseData(parse(text = text, keep.source = TRUE, encoding = "UTF-8"))
utf8_parse
#>   line1 col1 line2 col2 id parent   token terminal   text
#> 1     1    1     1    3  1      0 COMMENT     TRUE # <e9>
Encoding(utils::getParseText(utf8_parse, 1))
#> [1] "UTF-8"

# I think we need to convert before
text_enc <- enc2utf8(text)
Encoding(text_enc)
#> [1] "UTF-8"
utf8_parse <- utils::getParseData(parse(text = text_enc, keep.source = TRUE, encoding = "UTF-8"))
utf8_parse
#>   line1 col1 line2 col2 id parent   token terminal text
#> 1     1    1     1    3  1      0 COMMENT     TRUE  # é
Encoding(utils::getParseText(utf8_parse, 1))
#> [1] "UTF-8"

# String is latin-1
text <- "# é"
text_enc <- enc2utf8(text)

downlit::highlight(text)
#> Error in gsub("<U+2029>", "\033", text[is_comment], fixed = TRUE): input string 1 is invalid UTF-8
downlit::highlight(text_enc)
#> [1] "<span class='c'># é</span>"

I believe the above is true if downlit directly with non UTF-8 content. In the context of R Markdown, I don't really understand why the error would happen in bookdown - R Markdown assumes UTF-8 for the file and work in UTF-8 so content passed to downlit should be UTF-8.

Also I did not look specifically at the test on windows that #112 tries to solve. It is possibly different from this one. If I can help as a Windows user, tell me.

dmurdoch commented 3 years ago

Have you tried running the UCRT build of R, described here: https://svn.r-project.org/R-dev-web/trunk/WindowsBuilds/winutf8/ucrt3/howto.html ? The general problem on Windows has been that some R functions convert strings to the native encoding; if chars aren't representable there (or the function thinks they aren't) you get the escapes instead. The new build is an attempt to make UTF-8 the native encoding, so this problem will go away.

hadley commented 3 years ago

@cderv can you check that #112 fixes the problem? I think it should; I just need to work out the right hack to get this working on 3.6 and lower.

hadley commented 3 years ago

@dmurdoch fwiw, my expectation is that we can get this working on non-UCRT builds of windows, just with a little more work.

cderv commented 3 years ago

@hadley With installing from #112 I don't get any error now.

downlit::highlight("# é\n1:5")
#> [1] "<span class='c'># é</span>\n<span class='m'>1</span><span class='o'>:</span><span class='m'>5</span>"
packageVersion("downlit")
#> [1] '0.2.9000.9001'

I'll see if this is fixed for bookdown too of if this is something else.

r-lib / downlit

Escaping of non UTF8 character in comments fails #113