Encoding problems - Githubissues

d-01 commented 5 years ago

I have encountered a problem with cyrillic text encoding. From Windows Explorer:

files-list

From powershell console:

PS> ls |% name

cyrillic_7_chars=русский.txt
text-1251.txt
text-utf8.txt

PS> gc text-1251.txt

русский

PS> gc text-utf8.txt

С?С?С?С?РєРёР№

From Jupyter Notebook:

PS> ls |% name

cyrillic_7_chars=■■■■txt
text-1251.txt
text-utf8.txt

PS> gc text-1251.txt

■■■■

PS> gc text-utf8.txt

русский

I have found a workaround, but not sure how to apply this to fix the problem:

PS> [Text.Encoding]::Default.GetString([Text.Encoding]::UTF8.GetBytes((ls |% name) -join "`n"))

cyrillic_7_chars=русский.txt
text-1251.txt
text-utf8.txt

Environment information:

PS> [System.Text.Encoding]::Default

IsSingleByte      : True
BodyName          : koi8-r
EncodingName      : Cyrillic (Windows)
HeaderName        : windows-1251
...

PS> $psversiontable

Name                           Value                                           
----                           -----                                           
PSVersion                      5.1.14409.1005                                  
PSEdition                      Desktop                                         
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}                         
BuildVersion                   10.0.14409.1005                                 
CLRVersion                     4.0.30319.42000                                 
WSManStackVersion              3.0                                             
PSRemotingProtocolVersion      2.3                                             
SerializationVersion           1.1.0.1

The version of the notebook server is: 5.6.0 The server is running on this version of Python: Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)] Kernel info:

Name: powershell-kernel
Version: 0.0.8
Home-page: https://github.com/vors/jupyter-powershell
Author: Sergei Vorobev
Author-email: xvorsx@gmail.com

What else I've tried so far:

Changing $OutputEncoding global variable
Changing [console]::OutputEncoding
Changing [console]::InputEncoding
chcp 866 – doing nothing to cmd /cdir and Get-ChildItem / ls output
chcp 65001 – fixes cmd /cdir but not Get-ChildItem / ls output
Different browsers: Firefox, Chrome, IE11

Standard kernel (IPython 6.5.0) works fine: In:

import os
os.listdir()

Out:

['cyrillic_7_chars=русский.txt', 'text-1251.txt', 'text-utf8.txt']

From powershell console:

PS> [text.encoding]::Default.getbytes('русский') | format-hex

00000000   F0 F3 F1 F1 EA E8 E9                             ðóññêèé

PS> [text.encoding]::utf8.getbytes('русский') | format-hex

00000000   D1 80 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9        ÑÑÑÑÐºÐ¸Ð¹

From Jupyter Notebook:

PS> [text.encoding]::Default.getbytes('русский') | format-hex

00000000   D1 80 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9        N?N?N?N???????  

PS> [text.encoding]::utf8.getbytes('русский') | format-hex

00000000   D0 A1 D0 82 D0 A1 D1 93 D0 A1 D0 83 D0 A1 D0 83  ??????N?????????
00000010   D0 A0 D1 94 D0 A0 D1 91 D0 A0 E2 84 96           ?■N??■N??■a??

vors commented 5 years ago

Thank you for the detailed report! My uneducated guess would be that our python repl_process abstraction expects utf-8 but powershell by default uses utf-16, or perhaps that we incorrectly do increment decoding in the kernel. The kernel itself is relative small, I think you should have no troubles debugging it with changing the kernel code. I would not have time to do it any time soon myself, but I'm happy to help you navigate the code and code review any changes.

d-01 commented 5 years ago

Problem solved:

--- a/subprocess_repl.py.orig
+++ b/subprocess_repl.py
@@ -9,10 +9,16 @@ import os
 import sys
 import re
 import signal
+import locale
 from subprocess import Popen
 from codecs import getencoder, getincrementaldecoder

 PY3 = sys.version_info[0] == 3
+# On Windows encoding expected to be something like 'cp1252' (en) or 'cp1251' (ru)
+# depending on system-wide "System locale" setting.
+# Path to setting: Region and Language -> Administrative (tab) ->
+# -> Language for non-Unicode programs -> Change system locale...
+ENCODING = locale.getpreferredencoding()

 if os.name == 'posix':
     POSIX = True
@@ -23,8 +29,8 @@ else:

 class SubprocessRepl(object):
     def __init__(self, cmd):
-        self.encoder = getencoder('utf8')
-        self.decoder = getincrementaldecoder('utf8')()
+        self.encoder = getencoder(ENCODING)
+        self.decoder = getincrementaldecoder(ENCODING)()
         self.popen = Popen(cmd, bufsize=1,
             stderr=subprocess.STDOUT, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
         if POSIX:
@@ -60,7 +66,7 @@ class SubprocessRepl(object):
         si.flush()

     def reset_decoder(self):
-        self.decoder = getincrementaldecoder('utf8')()
+        self.decoder = getincrementaldecoder(ENCODING)()

     def read(self):
         """Reads at least one decoded char of output"""

vors commented 5 years ago

Nice! @d-01 would you mind to send a pull request?

vors / jupyter-powershell

Encoding problems #12