rundel / diffmatchpatch

R wrapper for Google's diff-match-patch library
5 stars 0 forks source link

encoding problem #1

Open agricolamz opened 1 month ago

agricolamz commented 1 month ago

Hi, thank you for the package. I tried to use it with Cyrillic script (here is a Tabasaran example); however, I've got the following problem...

diff_make("БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал даршул",
          "БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал дашул")

#> Error in gsub(st$close, st$open, txt, fixed = TRUE) : 
#>   input string 1 is invalid in this locale

The result is writable to the variable, and the contents of the table clearly indicate the encoding problem:


I've never experienced any encoding problems on my Linux machine, and I didn't find any encoding calls in your Rcpp code (however, I don't know Rcpp). Here are some more details:

Linux Mint
R 4.4.1
diffmatchpatch v. 0.1.0
alofting commented 1 month ago

I ran into a similar problem:

diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"

The text seems to be returned as UTF-8 by the underlying libarary, but interpreted by diff_make as Windows 1252 and turned into three characters. A simple recode returns the proper characters in this case:

diffmatchpatch::diff_make('\u{20ac}', '$')$text |> stringi::stri_encode("UTF-8", "Windows 1252")
#> [1] "€" "$"

I found the solution after having read this

I tried your Cryllic text, but was not successful in finding a working encoding instead of 'Windows 1252'.

agricolamz commented 1 month ago

Thanks for trying! I knew that there is an encoding problem, but I haven't thought to decode it. That is a great idea.

agricolamz commented 1 month ago

Oh no! Now I can't reproduce my problem -- the function crashes with the following message:

Error in gsub(st$close, st$open, txt, fixed = TRUE) : 
  input string 1 is invalid in this locale
Linux Mint
R 4.4.1
diffmatchpatch v. 0.1.0

And your code runs with no problem:

> diffmatchpatch::diff_make('€', '$')$text
[1] "€" "$"
alofting commented 3 weeks ago

My default R LC_CTYPE locale is 'English_United States.1252' (no set in .RProfile), whereas Reprex has 'English_United States.utf8' as default.

When using LC_CTYPE 'English_United States.1252' then diffmatchpatch mangles anything that is not stricly 'Latin1'. Changing LC_TYPE to 'en_US.UTF-8' extends recognized characeters, but not to your example (it does work fine on, but that probably uses the javascript implementation).

The error you are getting is in the print method for the returned diff_df object.

Sys.setlocale(category = 'LC_CTYPE', 'English_United States.1252')
#> Warning in Sys.setlocale(category = "LC_CTYPE", "English_United States.1252"):
#> using locale code page other than 65001 ("UTF-8") may cause problems
#> [1] "English_United States.1252"

diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"

diffmatchpatch::diff_make('БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал даршул', 'БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал дашул')$text
#> [1] "БицIидимиди Ñ\210вуÑ\210в гъахиÑ\200на гвачIнимиди уьл гъипIуÑ\200 Ñ\210вумал даÑ"
#> [2] "\200Ñ"                                                                                                             
#> [3] "\210ул"
Sys.setlocale(category = 'LC_CTYPE', 'en_US.UTF-8')
#> [1] "en_US.UTF-8"

diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"

diffmatchpatch::diff_make('БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал даршул', 'БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал дашул')$text
#> [1] "БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал да\xd1"
#> [2] "\x80\xd1"                                                        
#> [3] "\x88ул"