rundel / diffmatchpatch

R wrapper for Google's diff-match-patch library
https://rundel.github.io/diffmatchpatch/
Other
5 stars 0 forks source link

encoding problem #1

Open agricolamz opened 1 month ago

agricolamz commented 1 month ago

Hi, thank you for the package. I tried to use it with Cyrillic script (here is a Tabasaran example); however, I've got the following problem...

library(diffmatchpatch)
diff_make("БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал даршул",
          "БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал дашул")

#> Error in gsub(st$close, st$open, txt, fixed = TRUE) : 
#>   input string 1 is invalid in this locale

The result is writable to the variable, and the contents of the table clearly indicate the encoding problem:

image

I've never experienced any encoding problems on my Linux machine, and I didn't find any encoding calls in your Rcpp code (however, I don't know Rcpp). Here are some more details:

Linux Mint
R 4.4.1
diffmatchpatch v. 0.1.0
Sys.getlocale()
#> [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=ru_RU.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
alofting commented 1 month ago

I ran into a similar problem:

diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"

The text seems to be returned as UTF-8 by the underlying libarary, but interpreted by diff_make as Windows 1252 and turned into three characters. A simple recode returns the proper characters in this case:

diffmatchpatch::diff_make('\u{20ac}', '$')$text |> stringi::stri_encode("UTF-8", "Windows 1252")
#> [1] "€" "$"

I found the solution after having read this

I tried your Cryllic text, but was not successful in finding a working encoding instead of 'Windows 1252'.

agricolamz commented 1 month ago

Thanks for trying! I knew that there is an encoding problem, but I haven't thought to decode it. That is a great idea.

agricolamz commented 1 month ago

Oh no! Now I can't reproduce my problem -- the function crashes with the following message:

Error in gsub(st$close, st$open, txt, fixed = TRUE) : 
  input string 1 is invalid in this locale
Linux Mint
R 4.4.1
diffmatchpatch v. 0.1.0
Sys.getlocale()
#> [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=ru_RU.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

And your code runs with no problem:

> diffmatchpatch::diff_make('€', '$')$text
[1] "€" "$"
alofting commented 3 weeks ago

My default R LC_CTYPE locale is 'English_United States.1252' (no set in .RProfile), whereas Reprex has 'English_United States.utf8' as default.

When using LC_CTYPE 'English_United States.1252' then diffmatchpatch mangles anything that is not stricly 'Latin1'. Changing LC_TYPE to 'en_US.UTF-8' extends recognized characeters, but not to your example (it does work fine on https://neil.fraser.name/software/diff_match_patch/demos/diff.html, but that probably uses the javascript implementation).

The error you are getting is in the print method for the returned diff_df object.

Sys.setlocale(category = 'LC_CTYPE', 'English_United States.1252')
#> Warning in Sys.setlocale(category = "LC_CTYPE", "English_United States.1252"):
#> using locale code page other than 65001 ("UTF-8") may cause problems
#> [1] "English_United States.1252"

diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"

diffmatchpatch::diff_make('БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал даршул', 'БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал дашул')$text
#> [1] "БицIидимиди Ñ\210вуÑ\210в гъахиÑ\200на гвачIнимиди уьл гъипIуÑ\200 Ñ\210вумал даÑ"
#> [2] "\200Ñ"                                                                                                             
#> [3] "\210ул"
Sys.setlocale(category = 'LC_CTYPE', 'en_US.UTF-8')
#> [1] "en_US.UTF-8"

diffmatchpatch::diff_make('€', '$')$text
#> [1] "€" "$"

diffmatchpatch::diff_make('БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал даршул', 'БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал дашул')$text
#> [1] "БицIидимиди швушв гъахирна гвачIнимиди уьл гъипIур швумал да\xd1"
#> [2] "\x80\xd1"                                                        
#> [3] "\x88ул"