posit-dev / r-shinylive

https://posit-dev.github.io/r-shinylive/
Other
147 stars 15 forks source link

Add functions to encode and decode shinylive URLs #70

Open parmsam-pfizer opened 3 months ago

parmsam-pfizer commented 3 months ago

Similar to the Python feature that was added in 0.2.0, please consider adding shinylive URL encode and shinylive URL decode R functions to encode local apps into a shinylive.io URL or decode a shinylive.io URL into local files.

wch commented 3 months ago

The encoding and decoding requires the lzstring library, but as far as I can tell, there's not an lzstring implementation for R.

Here are some possibilities, although any of them will take some effort:

parmsam commented 3 months ago

Put together a wrapper based on your PR in an R package here. This is my first time using {cpp11} so I might've missed a few things. Some of the unit tests I added might be helpful. Note that there was one unit test that I was unable to port over from the C++ library over to the R package here. Would be curious to hear your thoughts about handling that case.

wch commented 3 months ago

Great! Are you planning on submitting the package to CRAN and maintain it in the future?

Some thoughts:

parmsam commented 3 months ago

Thanks! Fixed the unit test and added a condition into the exported functions to ensure UTF-8 encoding in the string being passed through.

I see what you mean. Looks like the std::wstring_convert class was deprecated in C++17. It hasn't been removed yet though. Similar situation for std::codecvt_utf8_utf16. Are there other classes that were removed? Would you recommend switching over to Rcpp instead, since it looks like cpp11 only supports C++11?

wch commented 3 months ago

Instead of throwing if the string is not UTF-8, I think you should use enc2utf8(), so that it just works for users. We do something similar in Shiny: https://github.com/rstudio/shiny/blob/420a2c0/R/utils.R#L1399-L1402

I don't know the details of what exactly was removed in C++20. As of R 4.0, the minimum C++ version that is supported is C++11, so this will have to work on C++11 through C++20. I think the existing code is OK on C++11 and 14, and you'll need a different strategy for C++17 and 20. And of course you'll have to do something for Windows. (You should be able to use Github Actions to test on Windows.)

I definitely think you should not switch to Rcpp. Rcpp has been around for longer and cpp11 supports newer C++ features than Rcpp. The minimum C++ version for cpp11 is C++11, but it will work with newer versions as well. cpp11 is also a lighter-weight dependency. See https://github.com/r-lib/cpp11/?tab=readme-ov-file#motivations

parmsam commented 3 months ago

You're right. Unfortunately, it looks like std::wstring_convert is missing for C++17 in Ubuntu OS for the R-CMD-check workflow. For example, for the Ubuntu latest release job, it shows the following error:

using C++ compiler: ‘g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’
g++ -std=gnu++17 -I"/opt/R/4.3.3/lib/R/include" -DNDEBUG  -I'/home/runner/work/_temp/Library/cpp11/include' -I/usr/local/include    -fpic  -g -O2  -c code.cpp -o code.o
code.cpp: In function ‘std::string compressToEncodedURIComponent_(std::string)’:
code.cpp:8:8: error: ‘wstring_convert’ is not a member of ‘std’

I haven't identified a drop in replacement for UTF-8 to UTF-16 conversion and vis-versa in C++ yet. There doesn't seem to be an easy fix for it.

wch commented 3 months ago

I think you might be able to use R's iconv() function to convert to UTF-16. It looks like there's UTF-16, which has a BOM, UTF-16LE, and UTF-16BE (I don't know if there are others).

iconv("abcdefg", from="UTF-8", to="UTF-16", toRaw=TRUE)
#> [[1]]
#>  [1] fe ff 00 61 00 62 00 63 00 64 00 65 00 66 00 67

iconv("abcdefg", from="UTF-8", to="UTF-16LE", toRaw=TRUE)
#> [[1]]
#>  [1] 61 00 62 00 63 00 64 00 65 00 66 00 67 00

iconv("abcdefg", from="UTF-8", to="UTF-16BE", toRaw=TRUE)
#> [[1]]
#>  [1] 00 61 00 62 00 63 00 64 00 65 00 66 00 67

Maybe you can encode it as UTF-16 with BOM in R, then send it in to the function as a raw vector, then in C++ convert that to a UTF-16 string. I asked Claude AI to come up with some code for the C++ side and posted it here: https://gist.github.com/wch/90a2d7446e00c9a5aef09d0f2fe01c73

parmsam commented 3 months ago

Great news! The checks for nearly all of the operating systems in the GH check standard workflow are passing now (except macOS-latest). have merged to my main branch and updated my R-CMD-check to point to the same one used in this repo: https://github.com/parmsam/lzstring-r/pull/4

wch commented 3 months ago

Good progress! A few thing I noticed when I took a quick look:

gadenbuie commented 3 months ago

First of all, I really appreciate the collaboration that's occurring here in this thread! I think there's a lot of value in bringing the lz-string algorithm to R via c++. It's refreshing to see the collaboration happening!

I don't want to get in the way or forestall the progress being made on that front. That said, I had done some work on this already using the original lz-string.js implementation called via V8. With this approach, performing the lzstring compression/decompression is quite easy and is performant, at least in my testing so far.

I've put the work into the feat/encode-decode-url branch, along with a proof-of-concept url_encode_dir() function. (Initial changes using v8.)

We could easily start with this approach and replace the encoding/decoding step with functions from lzstringr when its ready and published. It's also worth noting that the string encoding step is a small part of the overall work needed for encoding shinylive URLs. About 80% of the work lies in designing and implementing the R API for specifying the app bundle (plus documentation and testing, etc.).

parmsam commented 3 months ago

Thanks for sharing that @gadenbuie. That's awesome that the JS version works via V8. Expanded the unit tests and switched over to the standard GH workflow. I had problems converting the UTF-16 integer vector that had null bytes due to spaces into a raw vector. Here's an example after I compressed using compressToBase64() which worked fine:

> library(lzstring)
> c_string <- compressToBase64("Žluťoučký kůň úpěl ďábelské ódy!")
> c_string 
[1] "r6ABsK6KaAD2aLCADWBfgBPQ9oCAlAZAvgDobEARlB4QAEOAjAUxAGd4BL5AZ4BMBPAQiA=="
> string <- enc2utf8(c_string)
> string <- iconv(string, from="UTF-8", to="UTF-16", toRaw=TRUE)[[1]]
> result <- decompressFromBase64_(string) #using exported C++ function
> result
 [1] 381 108 117 357 111 117 269 107 253
[10]  32 107 367 328  32 250 112 283 108
[19]  32 271 225  98 101 108 115 107 233
[28]  32 243 100 121  33
> as.raw(result)
 [1] 00 6c 75 00 6f 75 00 6b fd 20 6b 00
[13] 00 20 fa 70 00 6c 20 00 e1 62 65 6c
[25] 73 6b e9 20 f3 64 79 21
Warning message:
out-of-range values treated as 0 in coercion to raw 
> rawToChar(as.raw(result))
Error in rawToChar(as.raw(result)) : 
  embedded nul in string: '\0lu\0ou\0k\xfd k\0\0 \xfap\0l \0\xe1belsk\xe9 \xf3dy!'
In addition: Warning message:
In rawToChar(as.raw(result)) :
  out-of-range values treated as 0 in coercion to raw

intToUTF8() seemed to resolve that problem, however it wasn't able to handle some edge cases that involved emojis like "𐐷𐑌 – Mathematical symbols: ∑ ∫, Emoji: 😊, Arabic: العربية, Hebrew: עברית". After some searching, it looks like null bytes (\0) in the data are interpreted as end-of-string markers in the rawToChar() function. Got a solution with some help from ChatGPT that seems to work. It uses UTF-16 surrogate pairs to do the character conversion: https://github.com/parmsam/lzstring-r/blob/main/R/lzstringr-package.R#L6-L49

On the topic of different operating systems, not sure what to do about a 'uchar.h' file not found error I get when trying to run the R-CMD-CHECK on Mac OS latest: https://github.com/parmsam/lzstring-r/actions/runs/8665978631/job/23765811813 It seems that unicode support file is missing. It might require a rewrite to avoid using it. Update: All the operating systems now pass after I removed the inclusion of "uchar.h" in the C++ implementation.

coatless commented 3 months ago

@gadenbuie / @wch is there a reason for needing a custom R implementation? Why not directly embed lz-string JS assets alongside the app? This would allow modifications to cell to be included when sending over to the shiny REPL and save some bandwidth.

gadenbuie commented 3 months ago

gadenbuie / @wch is there a reason for needing a custom R implementation?

We don't necessarily need a custom lzstring implementation (although it'd be nice to have one) but we do need to be able to call lzstring from R. The goal of this feature request is to provide a method that (from R) composes a shinylive.io URL for an app comprised of local files.

It's confusing because we use "shinylive" for several different contexts. The context here is local files → R → shinylive.io

parmsam commented 2 months ago

@wch / @gadenbuie lzstring is on CRAN now: https://cran.r-project.org/web/packages/lzstring/index.html