north-road / qgis-processing-r

QGIS Processing R Provider Plugin
https://north-road.github.io/qgis-processing-r/
GNU General Public License v3.0
63 stars 14 forks source link

UnicodeDecodeError #64

Open tohka opened 4 years ago

tohka commented 4 years ago

Hi,

An error occurs when selecting a layer that contains Japanese characters in the file path. The reason is probably because the R script will be generated in UTF-8 but R will try to interpret it as Shift_JIS or CP932 (Japanese encoding). OS: Windows10 (locale: Japanese)

Sample code

##Test=group
##Layer structure=name
##Layer=vector
str(Layer)

and log

QGIS version: 3.12.0-București
QGIS code revision: cd141490ec
Qt version: 5.11.2
GDAL version: 3.0.4
GEOS version: 3.8.0-CAPI-1.13.1 
PROJ version: Rel. 6.3.1, February 10th, 2020
R version: 2.0.0
Processing algorithm…
Algorithm 'Layer structure' starting…
Input parameters:
{ 'Layer' : 'C:/Users/username/Documents/サンプル.gpkg|layername=サンプル' }

R execution commands
options("repos"="http://cran.at.r-project.org/")
.libPaths("C:/Users/username/AppData/Roaming/QGIS/QGIS3/profiles/default/processing/rlibs")
tryCatch(find.package("sf"), error = function(e) install.packages("sf", dependencies=TRUE))
library("sf")
tryCatch(find.package("raster"), error = function(e) install.packages("raster", dependencies=TRUE))
library("raster")
Layer <- st_read("C:/Users/username/Documents/サンプル.gpkg", layer = "サンプル", quiet = TRUE, stringsAsFactors = FALSE)
str(Layer)

R execution console output
[1] "C:/Users/username/bin/R-Portable/App/R-Portable/library/sf"
Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
[1] "C:/Users/username/AppData/Roaming/QGIS/QGIS3/profiles/default/processing/rlibs/raster"
要求されたパッケージ sp をロード中です
警告メッセージ:
パッケージ 'raster' はバージョン 3.6.3 の R の下で造られました
Traceback (most recent call last):
File "C:/Users/username/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\processing_r\processing\algorithm.py", line 317, in processAlgorithm
output = RUtils.execute_r_algorithm(self, parameters, context, feedback)
File "C:/Users/username/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\processing_r\processing\utils.py", line 258, in execute_r_algorithm
for line in iter(proc.stdout.readline, ''):
UnicodeDecodeError: 'cp932' codec can't decode byte 0xef in position 3: illegal multibyte sequence

Execution failed after 1.39 seconds

Loading resulting layers
Algorithm 'Layer structure' finished

If you fix it as follows, the error will not occur.

--- utils.py.orig   Tue Apr 14 18:55:18 2020
+++ utils.py    Wed Apr 15 09:17:59 2020
@@ -238,6 +238,9 @@

         script_filename = RUtils.create_r_script_from_commands(script_lines)

+        script_lines = ['options(encoding = "UTF-8")', 'source("%s", encoding = "UTF-8")' % script_filename]
+        script_filename = RUtils.create_r_script_from_commands(script_lines)
+
         # run commands
         command = [
             RUtils.path_to_r_executable(script_executable=True),
JanCaha commented 4 years ago

The issue of encodings in R is really a problematic one. The solution you are suggesting is probably not a good one as setting options(encoding = "UTF-8") can have some unexpected side effects.

Can you try if either of these scripts will work? It tests if you can pass just the path and the second explicitly converts the path to your encoding.

`

Test=group

Layer structure=name

pass_filenames

Layer=vector

Layer = st_read(Layer, quiet = TRUE, stringsAsFactors = FALSE) str(Layer) `

`

Test=group

Layer structure=name

pass_filenames

Layer=vector

Encoding(Layer) <- "UTF-8" Layer <- enc2native(Layer) Layer = st_read(Layer, quiet = TRUE, stringsAsFactors = FALSE) str(Layer) `

tohka commented 4 years ago

Both of the presented scripts give an error.

Traceback (most recent call last):
File "C:/Users/username/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\processing_r\processing\algorithm.py", line 340, in processAlgorithm
output = RUtils.execute_r_algorithm(self, parameters, context, feedback)
File "C:/Users/username/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\processing_r\processing\utils.py", line 281, in execute_r_algorithm
for line in iter(proc.stdout.readline, ''):
UnicodeDecodeError: 'cp932' codec can't decode byte 0xef in position 3: illegal multibyte sequence

Execution failed after 1.38 seconds

Loading resulting layers
Algorithm 'Layer structure' finished

I ran the following scripts (in ShiftJIS and UTF-8) in R.

x <- "日本語"
print("end of script")
$ rscript test.sjis.R
[1] "end of script"

$ rscript test.utf8.R
Error: invalid multibyte character in parser at line 1
Execution halted

I just assigned a multibyte string and did not evaluate its value, but I got an error. It is expected that an error due to an encoding mismatch occurred when the R interpreter was parsing the script. Therefore, I thought I needed to add an encoding option when loading the script file to solve the problem.

JanCaha commented 4 years ago

Its all rather strange. I tested the solution you proposed and while it worked fine with non UTF-8 characters from my language (Czech) it cause an error if I used Japanese characters.

Could you, please, test your solution (the changes to utils.py) with layer that would be named: ěščř.gpkg? It is just couple of specific Czech symbols that worked for me.

I think that there might be some R setting causing the problems, most likely the locale.

tohka commented 4 years ago

"ěščř" coded by ISO 8859-2 is 0xEC 0xB9 0xE8 0xF8. Strictly interpreted as UTF-8, there are no applicable characters, but the unavailable bytes are not included. It doesn't make an error, but I'm not sure if it works correctly.

"日本語" coded by Shift-JIS (cp932) is 0x93 0xFA 0x96 0x7B 0x8C 0xEA. In UTF-8, 0x8X and 0x9X are not allowed in the most significant byte, so I guess it will be an error.

JanCaha commented 4 years ago

I would guess that it is related to the issue mentioned here: https://stackoverflow.com/questions/46946483/czech-encoding-in-r. Setting it in .Rprofile file would make it permanent for the system. You can select one of the available code pages from here, unfortunately, the UTFs are not available.

I don't see a way to solve it reasonably.

JanCaha commented 4 years ago

Looks like I found a solution that might work while not breaking anything.

Could you try changing the r_templates.py in the plugins directory for this version https://github.com/JanCaha/qgis-processing-r/blob/bug_utf-8/processing_r/processing/r_templates.py?

It works only for sf layers for now and it would need a lot of polishing if it should be used, but it seems to be working. What it does is passing the paths encoded from python a interpreting them as utf-8 in R. It works on my computer with Czech language even for Japaneese characters.