ropensci / katex

Server side math to html rendering in R
https://docs.ropensci.org/katex/reference/katex.html
Other
37 stars 3 forks source link

Escape unicode placeholders in HTML output #4

Closed bwiernik closed 3 years ago

bwiernik commented 3 years ago

Handles all of the characters that come to mind and does not interfere real HTML output as far as I can tell:

Simple escape example ``` r rd <- "a\\frac{1}{3}τρσ😀发短信" rd #> [1] "a\\frac{1}{3}Ï„Ï\201σðŸ\230\200å\217‘短信ð\235„ž" enc2native(rd) #> [1] "a\\frac{1}{3}Ï„Ï\201σðŸ\230\200å\217‘短信ð\235„ž" gsub( pattern = "<(U\\+[0-9A-Fa-f]{4,8})>", replacement = "<\\1>", x = enc2native(rd) ) #> [1] "a\\frac{1}{3}Ï„Ï\201σðŸ\230\200å\217‘短信ð\235„ž" ``` Created on 2021-07-15 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)
katex example ``` r rd <- katex::math_to_rd(katex::example_math()) rd #> [1] "\\if{html}{\\out{\n\nf(x)=1s2pe-12(x-µs)2f(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}f(x)=s2p1e-21(sx-µ)2\n}}\n\\if{latex,text}{\n\\deqn{\nf(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}\n}{\nf(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}\n}}" #> attr(,"class") #> [1] "Rdtext" gsub( pattern = "<(U\\+[0-9A-Fa-f]{4,8})>", replacement = "<\\1>", x = enc2native(rd) ) #> [1] "\\if{html}{\\out{\n\nf(x)=1s2pe-12(x-µs)2f(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}f(x)=s2p<U+200B>1<U+200B>e-21<U+200B>(sx-µ<U+200B>)2\n}}\n\\if{latex,text}{\n\\deqn{\nf(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}\n}{\nf(x)= {\\frac{1}{\\sigma\\sqrt{2\\pi}}}e^{- {\\frac {1}{2}} (\\frac {x-\\mu}{\\sigma})^2}\n}}" #> attr(,"class") #> [1] "Rdtext" ``` Created on 2021-07-15 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)

Does not handle this bit from the R documentation:

Note UTF-16 surrogate pairs in \unnnn\uoooo form will be converted to a single Unicode point, so for example \uD834\uDD1E gives the single character \U1D11E. However, unpaired values in the surrogate range such as in the string "abc\uD834de" will be converted to a non-standard-conformant UTF-8 string (as is done by most other software): this may change in future.

Those escapes do not have a distinctive pattern and would generally indicate a mistake in the string anyway.

Non-handled malformed UTF-16 characters ``` r enc2native("abcde") #> [1] "abcde" ``` Created on 2021-07-15 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)

Closes #2

jeroen commented 3 years ago

Thanks I like this solution. Perhaps we don't need enc2native() at all, if we can substitute the non ascii characters with escape sequences, such that we won't get the approximate ASCII analogue character?

jeroen commented 3 years ago

Maybe a safer route is to do the conversion in javascript, e.g. using https://github.com/mathiasbynens/he