Open parmsam-pfizer opened 7 months ago
The encoding and decoding requires the lzstring library, but as far as I can tell, there's not an lzstring implementation for R.
Here are some possibilities, although any of them will take some effort:
Put together a wrapper based on your PR in an R package here. This is my first time using {cpp11} so I might've missed a few things. Some of the unit tests I added might be helpful. Note that there was one unit test that I was unable to port over from the C++ library over to the R package here. Would be curious to hear your thoughts about handling that case.
Great! Are you planning on submitting the package to CRAN and maintain it in the future?
Some thoughts:
#ifdef
s to do different things on different C++ versions.#ifdef
s for that.Thanks! Fixed the unit test and added a condition into the exported functions to ensure UTF-8 encoding in the string being passed through.
I see what you mean. Looks like the std::wstring_convert
class was deprecated in C++17. It hasn't been removed yet though. Similar situation for std::codecvt_utf8_utf16. Are there other classes that were removed? Would you recommend switching over to Rcpp instead, since it looks like cpp11 only supports C++11?
Instead of throwing if the string is not UTF-8, I think you should use enc2utf8()
, so that it just works for users. We do something similar in Shiny:
https://github.com/rstudio/shiny/blob/420a2c0/R/utils.R#L1399-L1402
I don't know the details of what exactly was removed in C++20. As of R 4.0, the minimum C++ version that is supported is C++11, so this will have to work on C++11 through C++20. I think the existing code is OK on C++11 and 14, and you'll need a different strategy for C++17 and 20. And of course you'll have to do something for Windows. (You should be able to use Github Actions to test on Windows.)
I definitely think you should not switch to Rcpp. Rcpp has been around for longer and cpp11 supports newer C++ features than Rcpp. The minimum C++ version for cpp11 is C++11, but it will work with newer versions as well. cpp11 is also a lighter-weight dependency. See https://github.com/r-lib/cpp11/?tab=readme-ov-file#motivations
You're right. Unfortunately, it looks like std::wstring_convert
is missing for C++17 in Ubuntu OS for the R-CMD-check workflow. For example, for the Ubuntu latest release job, it shows the following error:
using C++ compiler: ‘g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’
g++ -std=gnu++17 -I"/opt/R/4.3.3/lib/R/include" -DNDEBUG -I'/home/runner/work/_temp/Library/cpp11/include' -I/usr/local/include -fpic -g -O2 -c code.cpp -o code.o
code.cpp: In function ‘std::string compressToEncodedURIComponent_(std::string)’:
code.cpp:8:8: error: ‘wstring_convert’ is not a member of ‘std’
I haven't identified a drop in replacement for UTF-8 to UTF-16 conversion and vis-versa in C++ yet. There doesn't seem to be an easy fix for it.
I think you might be able to use R's iconv()
function to convert to UTF-16. It looks like there's UTF-16
, which has a BOM, UTF-16LE
, and UTF-16BE
(I don't know if there are others).
iconv("abcdefg", from="UTF-8", to="UTF-16", toRaw=TRUE)
#> [[1]]
#> [1] fe ff 00 61 00 62 00 63 00 64 00 65 00 66 00 67
iconv("abcdefg", from="UTF-8", to="UTF-16LE", toRaw=TRUE)
#> [[1]]
#> [1] 61 00 62 00 63 00 64 00 65 00 66 00 67 00
iconv("abcdefg", from="UTF-8", to="UTF-16BE", toRaw=TRUE)
#> [[1]]
#> [1] 00 61 00 62 00 63 00 64 00 65 00 66 00 67
Maybe you can encode it as UTF-16 with BOM in R, then send it in to the function as a raw vector, then in C++ convert that to a UTF-16 string. I asked Claude AI to come up with some code for the C++ side and posted it here: https://gist.github.com/wch/90a2d7446e00c9a5aef09d0f2fe01c73
Great news! The checks for nearly all of the operating systems in the GH check standard workflow are passing now (except macOS-latest). have merged to my main branch and updated my R-CMD-check to point to the same one used in this repo: https://github.com/parmsam/lzstring-r/pull/4
Good progress! A few thing I noticed when I took a quick look:
Most of the code and tests assume that the native encoding of the system is UTF-8. (For example, much of the code assumes that if literal strings with extended characters will be encoded as UTF-8.) That is not necessarily true, especially on Windows with R < 4.1. If you don't have access to a Windows machine, you can still try setting the locale of a session to something that's not UTF-8, as described here: https://gist.github.com/wch/3a629cfe575846a14207
safe_compress
calls rawToChar(as.raw(result))
, but safe_decompress
calls intToUtf8(result)
. Just curious, why are those different? I believe the C++ code returns a vector<u16string>
in both cases. When you use rawToChar()
I believe you also have to explicitly set the encoding of the result to UTF-8, as is done here: https://github.com/rstudio/htmltools/blob/038ef7be/tests/testthat/test-template.R#L38-L39
Otherwise, the resulting string will be assumed to be using the native encoding. All of this will be more clear if you test on a platform where the native encoding is not UTF-8.
If you create a string with \Uxxxx
, the string will definitely be encoded as UTF-8.
The files https://github.com/rstudio/htmltools/blob/038ef7be/tests/testthat/test-tags.r and https://github.com/rstudio/htmltools/blob/038ef7be/tests/testthat/test-template.R has some examples of creating strings from raw vectors and explicitly setting the encoding. This will ensure that test strings have a particular encoding, across platforms.
I would not suggest using the Shiny workflows for this -- probably better to use something standard from https://github.com/r-lib/actions.
First of all, I really appreciate the collaboration that's occurring here in this thread! I think there's a lot of value in bringing the lz-string
algorithm to R via c++. It's refreshing to see the collaboration happening!
I don't want to get in the way or forestall the progress being made on that front. That said, I had done some work on this already using the original lz-string.js implementation called via V8. With this approach, performing the lzstring compression/decompression is quite easy and is performant, at least in my testing so far.
I've put the work into the feat/encode-decode-url branch, along with a proof-of-concept url_encode_dir()
function. (Initial changes using v8.)
We could easily start with this approach and replace the encoding/decoding step with functions from lzstringr
when its ready and published. It's also worth noting that the string encoding step is a small part of the overall work needed for encoding shinylive URLs. About 80% of the work lies in designing and implementing the R API for specifying the app bundle (plus documentation and testing, etc.).
Thanks for sharing that @gadenbuie. That's awesome that the JS version works via V8. Expanded the unit tests and switched over to the standard GH workflow. I had problems converting the UTF-16 integer vector that had null bytes due to spaces into a raw vector. Here's an example after I compressed using compressToBase64()
which worked fine:
> library(lzstring)
> c_string <- compressToBase64("Žluťoučký kůň úpěl ďábelské ódy!")
> c_string
[1] "r6ABsK6KaAD2aLCADWBfgBPQ9oCAlAZAvgDobEARlB4QAEOAjAUxAGd4BL5AZ4BMBPAQiA=="
> string <- enc2utf8(c_string)
> string <- iconv(string, from="UTF-8", to="UTF-16", toRaw=TRUE)[[1]]
> result <- decompressFromBase64_(string) #using exported C++ function
> result
[1] 381 108 117 357 111 117 269 107 253
[10] 32 107 367 328 32 250 112 283 108
[19] 32 271 225 98 101 108 115 107 233
[28] 32 243 100 121 33
> as.raw(result)
[1] 00 6c 75 00 6f 75 00 6b fd 20 6b 00
[13] 00 20 fa 70 00 6c 20 00 e1 62 65 6c
[25] 73 6b e9 20 f3 64 79 21
Warning message:
out-of-range values treated as 0 in coercion to raw
> rawToChar(as.raw(result))
Error in rawToChar(as.raw(result)) :
embedded nul in string: '\0lu\0ou\0k\xfd k\0\0 \xfap\0l \0\xe1belsk\xe9 \xf3dy!'
In addition: Warning message:
In rawToChar(as.raw(result)) :
out-of-range values treated as 0 in coercion to raw
intToUTF8()
seemed to resolve that problem, however it wasn't able to handle some edge cases that involved emojis like "𐐷𐑌 – Mathematical symbols: ∑ ∫, Emoji: 😊, Arabic: العربية, Hebrew: עברית". After some searching, it looks like null bytes (\0) in the data are interpreted as end-of-string markers in the rawToChar() function. Got a solution with some help from ChatGPT that seems to work. It uses UTF-16 surrogate pairs to do the character conversion: https://github.com/parmsam/lzstring-r/blob/main/R/lzstringr-package.R#L6-L49
On the topic of different operating systems, not sure what to do about a 'uchar.h' file not found
error I get when trying to run the R-CMD-CHECK on Mac OS latest: https://github.com/parmsam/lzstring-r/actions/runs/8665978631/job/23765811813 It seems that unicode support file is missing. It might require a rewrite to avoid using it. Update: All the operating systems now pass after I removed the inclusion of "uchar.h" in the C++ implementation.
@gadenbuie / @wch is there a reason for needing a custom R implementation? Why not directly embed lz-string
JS assets alongside the app? This would allow modifications to cell to be included when sending over to the shiny REPL and save some bandwidth.
gadenbuie / @wch is there a reason for needing a custom R implementation?
We don't necessarily need a custom lzstring implementation (although it'd be nice to have one) but we do need to be able to call lzstring from R. The goal of this feature request is to provide a method that (from R) composes a shinylive.io URL for an app comprised of local files.
It's confusing because we use "shinylive" for several different contexts. The context here is local files → R → shinylive.io
@wch / @gadenbuie lzstring is on CRAN now: https://cran.r-project.org/web/packages/lzstring/index.html
Similar to the Python feature that was added in 0.2.0, please consider adding shinylive URL encode and shinylive URL decode R functions to encode local apps into a shinylive.io URL or decode a shinylive.io URL into local files.