Character Encoding - Githubissues

GoogleCodeExporter commented 9 years ago

What is the reasoning to use ISO8859-1 by default to encode strings in the 
protocol? I can't find any reference in the RFC.

Original issue reported on code.google.com by dkocher@sudo.ch on 28 May 2011 at 9:39

GoogleCodeExporter commented 9 years ago

1.) Let us assume that the client is not specfying a character encoding (i.e., 
an application using the SCP client)

2.) The the software must then always behave identically, no matter where it is 
being run. As a consequence, we cannot use the platform's default encoding 
(think of, e.g., your application suddenly being run on an EBCDIC machine).

3.) As a default, we must therefore pick a default encoding that is a.) present 
in every JVM implementation and b.) works for the "typical" use case (e.g., scp 
to a remote filesystem with umlauts in filenames)

ISO8859_1 is such a candidate. It is backwards compatible to US ASCII and 
supports most languages in western europe (however, it lacks the EUR symbol).

I am not sure, but I think that every possible character encoding in ISO8859_1 
has the same encoding in UTF-8 as well. Before switching to UTF-8 (or any other 
encoding) we should be careful, as this may break existing applications that 
use the library.

Original comment by cleondris on 31 May 2011 at 7:59

GoogleCodeExporter commented 9 years ago

Doesn't the SSH specification say something about a default character encoding?

Original comment by dkocher@sudo.ch on 31 May 2011 at 8:06

GoogleCodeExporter commented 9 years ago

I see three important places where encoding matters a lot:

1.) Filenames in SCP. Unfortunately, SCP is not documented as part of the SSH 
RFCs. The only reference is the OpenSSH source code (and the rcp source code).

2.) Filenames in SFTP. The latest SFTP draft says "The preferred encoding for 
filenames is UTF-8". The draft also defines a mechanism to disable the 
translation of native filenames as a fallback.

3.) Starting a shell or a command ("SSH_MSG_CHANNEL_REQUEST", RFC 4254 section 
6.5). The RFC says that the "command" argument is stored using the SSH "string" 
datatype, however *nothing* about the used encoding (in other places, the 
encoding is explicitly specified, see, e.g., section 5.1 
SSH_MSG_CHANNEL_OPEN_FAILURE: the description is a string in in ISO-10646 UTF-8 
encoding).

Let's see what the specs say in general:

RFC 4251 (in section 4.5) has a general remark about data passed over the SSH 
protocol that is shown to the user (e.g., questions in interactive 
authentication, banners, etc.): "In most places, ISO-10646 UTF-8 encoding is 
used".

When defining the SSH "string" datatype (section 5), the RFC says: "Strings are 
also used to store text. In that case, US-ASCII is used for internal names, and 
ISO-10646 UTF-8 for text that might be displayed to the user."

I think we should switch from ISO8859-1 to UTF-8 and give the user (where 
applicable) the chance to override the default encoding. Also, we need to 
clearly document this in the release notes.

Original comment by cleondris on 2 Jun 2011 at 8:33

GoogleCodeExporter commented 9 years ago

Great extensive analysis. It matches my experience in practice that UTF-8 is 
the best match for SSH as I never had any issues with. Please review r40.

Original comment by dkocher@sudo.ch on 2 Jun 2011 at 10:19

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Sorry, need to revise my own statement about encoding:

"I am not sure, but I think that every possible character encoding in ISO8859_1 
has the same encoding in UTF-8 as well."

The above is *wrong*.

While every character in ISO 8859-1 represented by the *byte* XY has the same 
*codepoint* XY in Unicode, this does not mean that the byte XY is a valid UTF-8 
encoding for the codepoint.

Example: 0xE4 represents in ISO 8859-1 and in Unicode the german umlaut "a". 
However, to encode the value "0xE4" in UTF-8 one needs the two-byte sequence 
"0xC3 0xA4".

In other words, the *codepoints* 0x00-0xFF have the same meaning in ISO 8859-1 
and Unicode. However, the *encoding* of those codepoints is not the same in 
UTF-8 and ISO 8859-1.

(...and let's ignore for the moment the difference between "ISO-8859-1" and 
"ISO 8859-1" aka. ISO/IEC 8859-1 (see the difference?)).

What is the consequence? Well, if we switch the default encoding from Java's 
"ISO8859_1" to "UTF-8", then we will break many applications that rely on that 
default behavior.

One could demonstrate the problem by connecting to a SFTP server which encodes 
filenames in Windows-1252 (cp1252) and has filenames that contain german 
umlauts.

Original comment by cleondris on 13 Jun 2011 at 3:11

GoogleCodeExporter commented 9 years ago

Is it possible to programaticlly change the encoding somehow?

Original comment by stures.s...@gmail.com on 25 Aug 2011 at 1:26

hogehoge-co-jp / ganymed-ssh-2

Character Encoding #7