ksh93 / ksh

ksh 93u+m: KornShell lives! | Latest release: https://github.com/ksh93/ksh/releases
Eclipse Public License 2.0
172 stars 30 forks source link

ksh93's printf(1) builtin somehow turns "\x20" (space character in UTF-8) into garbage #762

Open gisburn opened 2 weeks ago

gisburn commented 2 weeks ago

ksh93's printf(1) builtin somehow turns "\x20" (space character in UTF-8) into garbage:

Testcase:

$ (LC_ALL=en_US.UTF-8 ksh93 -c "printf 'foo\x20bar\n'")
foo₺r
$ (LC_ALL=en_US.UTF-8 bash -c "printf 'foo\x20bar\n'")
foo bar

This happens with ksh93 1.0.8 and 1.0.9 on both Debian Linux 10+Cygwin 3.5.3

McDutchie commented 2 weeks ago

Also reproducible on macOS, on ksh 93u+m/1.0.9 as well as ksh 93u+ 2012-08-01. Thanks for the report.

McDutchie commented 2 weeks ago

Actually this may not be exactly a bug – b and a are hexadecimal digits, so it is taking 20ba as a number, which happens to be out of range. In an ANSI C string, this is an error, so your reproducer is not actually valid.

It works fine with protective square brackets (which are a ksh extension):

$ printf 'foo\x[20]bar\n'
foo bar

However, I noticed that the 93v- beta changed the behaviour, and your reproducer started working 'correctly' between version 2013-06-28 and 2013-07-19. So we can consider isolating and backporting that change.

gisburn commented 2 weeks ago

AFAIK the POSIX standard somewhere says that \x takes exactly TWO hexadecimal digits, and that'as why this was fixed for ksh93v- (POSIX test suite failure).

McDutchie commented 2 weeks ago

Relevant historic entry in src/lib/libast/RELEASE:

13-07-16 string/chresc.c: limit \xH \xHH to 2 hex digits (\u and \U handle wide chars now)
McDutchie commented 2 weeks ago

\x is not in POSIX at all (see https://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html, Extended Description).

McDutchie commented 2 weeks ago

This patch backports the relevant 93v- 2013-07-19 changes. All existing regression tests pass with the patch, and your reproducer is fixed.

Now we will need to decide whether to change the behaviour for the next 93u+m/1.0.10 bugfix release or backport the change only for the future 93u+m/1.1 major update.

I think it needs to be the latter, because the change is not backward compatible. For instance:

$ echo $'\x1f600'
😀
$ echo $'\u[1f600]'
😀

The patch breaks the \x case (although \x[1f600] continues to work).

diff --git a/src/lib/libast/string/chresc.c b/src/lib/libast/string/chresc.c
index 31e8f5865..2bdc589f4 100644
--- a/src/lib/libast/string/chresc.c
+++ b/src/lib/libast/string/chresc.c
@@ -1,8 +1,8 @@
 /***********************************************************************
 *                                                                      *
 *               This software is part of the ast package               *
-*          Copyright (c) 1985-2012 AT&T Intellectual Property          *
-*          Copyright (c) 2020-2023 Contributors to ksh 93u+m           *
+*          Copyright (c) 1985-2013 AT&T Intellectual Property          *
+*          Copyright (c) 2020-2024 Contributors to ksh 93u+m           *
 *                      and is licensed under the                       *
 *                 Eclipse Public License, Version 2.0                  *
 *                                                                      *
@@ -147,14 +147,18 @@ chrexp(const char* s, char** p, int* m, int flags)
                c = CC_vt;
                break;
            case 'u':
+               q = s + 4;
+               goto wex;
            case 'U':
+               q = s + 8;
+           wex:
+               if (!(flags & FMT_EXP_WIDE))
+                   goto noexpand;
+               w = 1;
+               goto hex;
            case 'x':
-               if (q = c == 'u' ? (s + 4) : c == 'U' ? (s + 8) : NULL)
-               {
-                   if (!(flags & FMT_EXP_WIDE))
-                       goto noexpand;
-                   w = 1;
-               }
+               q = s + 2;
+           hex:
                b = e = s;
                n = 0;
                c = 0;
avih commented 2 weeks ago

\x is not in POSIX at all

Correct for the shell.

So, unless documented otherwise, and ignoring backward compatibility for now, I'd think the optimal behavior would be to follow the C string specification (which builds on a single character constant spec).

To summarize the C behavior of the escape sequence \x... in C99 "6.4.4.4 Character constants":

To quote from C99 in 6.4.4.4:

14
EXAMPLE 3
Even if eight bits are used for objects that have type char, the
construction '\x123' specifies an integer character constant
containing only one character, since a hexadecimal escape sequence
is terminated only by a non-hexadecimal character. To specify
an integer character constant containing the two characters whose
values are '\x12' and '3', the construction '\0223' may be used,
since an octal escape sequence is terminated after three octal
digits. (The value of this two-character integer character constant
is implementation-defined.)

I don't know whether a shell string should be considered wide or normal, but I'd think that at least if the encoding is UTF-8, then it should be considered normal, at which case a \x... represents one char/byte, and useful value for such sequence would be, IMHO, the unsigned value of the whole sequence (as many hex digits which it might have) modulu 256 (assuming 8 bits char).

gisburn commented 2 weeks ago
  1. I would prefer that for single-bytes we get \x<hexchar><hexchar> and \x[<hexchar><hexchar>] working, but only for single bytes, otherwise we have an endian issue.

  2. Wide-char charcters should be covered by\u[<hexvalue>] and \w[<hexvalue>] - I and the SUN i18n team made a patch for ksh93v- for that. \u[<hexvalue>] takes in a Unicode codepoint value, and uses |iconv()| internally to convert it to a multibyte sequence in the current locale (and evaluates to empty string if the character cannot be represented) \w[<hexvalue>] is basically the same, but just converts a |wchar_t| codepoint to the matching multibyte sequence, but without |iconv()| from UTC32 to current locale. This was added to satisfy PRC requirements for GB18030, so you can take a GB18030 codepoint and get the current character. I think I can dig out the patch for\u[<hexvalue>] and\w[<hexvalue>] later this month and give it to you.

avih commented 2 weeks ago
  1. I would prefer that for single-bytes we get \x<hexchar><hexchar> and \x[<hexchar><hexchar>] working, but only for single bytes, otherwise we have an endian issue.

As far as I can tell that's also how other shells which support \x behave:

$ shcmp -c 'printf "X\x4040Y"'
...

= bash, bash_posix, loksh, oksh, mksh, mksh_posix, pdksh, bb_ash, nbsh:
X@40Y(exit:0)

= ksh93, ksh93u_m:
X䁀Y(exit:0)

The rest (all dash versions, freebsd sh, Schilly's bosh, yash) don't support \x as a hex sequenced.

(comparison output using shcmp)

McDutchie commented 36 minutes ago

@avih wrote:

[…] I'd think the optimal behavior would be to follow the C string specification (which builds on a single character constant spec).

To summarize the C behavior of the escape sequence \x... in C99 "6.4.4.4 Character constants":

  • The sequence includes all chars until non-hex-digit (or end of the string).

[…]

So, AFAICT, that would mean changing nothing and keeping the existing 93u+m behaviour.

On the other hand, your shell test results show that 93u+(m) is very much the odd one out at the moment and all the other shells limit \x to two hexadecimal digits.

I favour backporting the 93v- change for 93u+m/1.1 to fall in line with the rest of them (including, notably, the pdksh variants). As the libast RELEASE entry said, \u and \U should be used for wide/multibyte characters.