Escape unicode placeholders in HTML output

bwiernik commented 3 years ago

Handles all of the characters that come to mind and does not interfere real HTML output as far as I can tell:

Simple escape example

``` r rd <- "a\\frac{1}{3}τρσ😀发短信" rd #> [1] "a\\frac{1}{3}Ï„Ï\201ÏƒðŸ\230\200å\217‘çŸä¿¡ð\235„ž" enc2native(rd) #> [1] "a\\frac{1}{3}Ï„Ï\201ÏƒðŸ\230\200å\217‘çŸä¿¡ð\235„ž" gsub( pattern = "<(U\\+[0-9A-Fa-f]{4,8})>", replacement = "<\\1>", x = enc2native(rd) ) #> [1] "a\\frac{1}{3}Ï„Ï\201ÏƒðŸ\230\200å\217‘çŸä¿¡ð\235„ž" ``` ^{Created on 2021-07-15 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)}

katex example

``` r rd <- katex::math_to_rd(katex::example_math()) rd #> [1] "\\if{html}{\\out{\n\n

f(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}

\n}}\n\\if{latex,text}{\n\\deqn{\nf(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}\n}{\nf(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}\n}}" #> attr(,"class") #> [1] "Rdtext" gsub( pattern = "<(U\\+[0-9A-Fa-f]{4,8})>", replacement = "<\\1>", x = enc2native(rd) ) #> [1] "\\if{html}{\\out{\n\n

f(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}

\n}}\n\\if{latex,text}{\n\\deqn{\nf(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}\n}{\nf(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}\n}}" #> attr(,"class") #> [1] "Rdtext" ``` ^{Created on 2021-07-15 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)}

Does not handle this bit from the R documentation:

Note UTF-16 surrogate pairs in \unnnn\uoooo form will be converted to a single Unicode point, so for example \uD834\uDD1E gives the single character \U1D11E. However, unpaired values in the surrogate range such as in the string "abc\uD834de" will be converted to a non-standard-conformant UTF-8 string (as is done by most other software): this may change in future.

Those escapes do not have a distinctive pattern and would generally indicate a mistake in the string anyway.

Non-handled malformed UTF-16 characters

``` r enc2native("abcde") #> [1] "abcde" ``` ^{Created on 2021-07-15 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)}

Closes #2

jeroen commented 3 years ago

Thanks I like this solution. Perhaps we don't need enc2native() at all, if we can substitute the non ascii characters with escape sequences, such that we won't get the approximate ASCII analogue character?

jeroen commented 3 years ago

Maybe a safer route is to do the conversion in javascript, e.g. using https://github.com/mathiasbynens/he

ropensci / katex

Escape unicode placeholders in HTML output #4