Open agricolamz opened 4 months ago
I ran into a similar problem:
diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"
The text seems to be returned as UTF-8 by the underlying libarary, but interpreted by diff_make
as Windows 1252 and turned into three characters. A simple recode returns the proper characters in this case:
diffmatchpatch::diff_make('\u{20ac}', '$')$text |> stringi::stri_encode("UTF-8", "Windows 1252")
#> [1] "€" "$"
I found the solution after having read this
I tried your Cryllic text, but was not successful in finding a working encoding instead of 'Windows 1252'.
Thanks for trying! I knew that there is an encoding problem, but I haven't thought to decode it. That is a great idea.
Oh no! Now I can't reproduce my problem -- the function crashes with the following message:
Error in gsub(st$close, st$open, txt, fixed = TRUE) :
input string 1 is invalid in this locale
Linux Mint
R 4.4.1
diffmatchpatch v. 0.1.0
Sys.getlocale()
#> [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=ru_RU.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
And your code runs with no problem:
> diffmatchpatch::diff_make('€', '$')$text
[1] "€" "$"
My default R LC_CTYPE locale is 'English_United States.1252' (no set in .RProfile), whereas Reprex has 'English_United States.utf8' as default.
When using LC_CTYPE 'English_United States.1252' then diffmatchpatch mangles anything that is not stricly 'Latin1'. Changing LC_TYPE to 'en_US.UTF-8' extends recognized characeters, but not to your example (it does work fine on https://neil.fraser.name/software/diff_match_patch/demos/diff.html, but that probably uses the javascript implementation).
The error you are getting is in the print method for the returned diff_df object.
Sys.setlocale(category = 'LC_CTYPE', 'English_United States.1252')
#> Warning in Sys.setlocale(category = "LC_CTYPE", "English_United States.1252"):
#> using locale code page other than 65001 ("UTF-8") may cause problems
#> [1] "English_United States.1252"
diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"
diffmatchpatch::diff_make('БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал даршул', 'БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал дашул')$text
#> [1] "БицIидимиди Ñ\210вуÑ\210в гъахиÑ\200на гвачIнимиди уьл гъипIуÑ\200 Ñ\210вумал даÑ"
#> [2] "\200Ñ"
#> [3] "\210ул"
Sys.setlocale(category = 'LC_CTYPE', 'en_US.UTF-8')
#> [1] "en_US.UTF-8"
diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"
diffmatchpatch::diff_make('БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал даршул', 'БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал дашул')$text
#> [1] "БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал да\xd1"
#> [2] "\x80\xd1"
#> [3] "\x88ул"
Hi, thank you for the package. I tried to use it with Cyrillic script (here is a Tabasaran example); however, I've got the following problem...
The result is writable to the variable, and the contents of the table clearly indicate the encoding problem:
I've never experienced any encoding problems on my Linux machine, and I didn't find any encoding calls in your Rcpp code (however, I don't know Rcpp). Here are some more details: