ksh93 / ksh

ksh 93u+m: KornShell lives! | Latest release: https://github.com/ksh93/ksh/releases
Eclipse Public License 2.0
186 stars 31 forks source link

Anchored empty string should match in replacement, e.g. `"${@/#/replacement}"` #558

Closed ormaaj closed 1 year ago

ormaaj commented 2 years ago

This is useful for compatibility and I think it's supposed to work anyway.

Working in ksh93 v- and u+m:

Non-working:

McDutchie commented 2 years ago

Not sure whether to tag this 'bug' or 'enhancement'.

Hmmm…

$ ksh -c 'x=a#c; echo ${x/#/b}'
a#c

The output is not abc are you might expect, so maybe it's a bug.

phidebian commented 2 years ago

Just curious, does the # character have special meaning in RE writing (RE grammar), I stick with ERE, I am perl/boost iliterate, so to my knowledge # has no meaning (stand by for itself), then I admit we should match it. Yet bash is strange too.

       BASH                      |                   KSH
x=a#c; echo ${x/#/b}  ==> ba#c   |  x=a#c; echo ${x/#/b}  ==> a#c   <== both bogus
x=a#c; echo ${x/\#/b} ==> abc    |  x=a#c; echo ${x/\#/b} ==> abc
x=a#c; echo ${x//#/b} ==> abc    |  x=a#c; echo ${x//#/b} ==> abc

So the question

I think I red somewhere that 'may be' ksh has a perl-like mode for RE, but htis is beyond my understanding.

EDIT: Just discovered that indeed # and % are special for ksh (not bash). The man sez.

  ${parameter/pattern/string}
  ${parameter//pattern/string}
  ${parameter/#pattern/string}
  ${parameter/%pattern/string}

So ksh is right, in order to match a # the start of a /pattern/ it got to be quoted (as well as %) then ksh expansion is correct, actually only bash is bogus there.

This should be closed as 'pilot error' albeit a ledgit one :-), I did the same.

ormaaj commented 1 year ago

Just to clarify, the bash output is the expected result there. x=a#c; echo ${x/#/b} ==> ba#c. The # and % characters are part of the parameter expansion syntax, as noted by @phidebian . Not sure I understand...

So ksh is right

No.

EDIT: Just discovered that indeed # and % are special for ksh (not bash). The man sez.

They are special in both, those are part of the PE syntax that modify the meaning of the pattern.

in order to match a # the start of a /pattern/ it got to be quoted

Correct. Quoted or escaped.

(as well as %)

That would be at the end of the pattern.

actually only bash is bogus there.

Not as far as I'm aware.

If it helps, it should be equivalent to this:

 $ /dotfiles/bin/ksh93-dev -c 'x=abc; print -r -- "${x/~(K-r)/x}"'
xabc

I am not sure why the reverse doesn't match

 $ /dotfiles/bin/ksh93-dev -c 'x=abc; print -r -- "${x/~(K-l)/x}"'
abc

That should be equal to

 $ /dotfiles/bin/ksh93-dev -c 'x=abc; print -r -- "${x/~(K-l)?/\0x}"'
abcx

I would think. (that is a rarely used feature. as in I've never seen it used.)

phidebian commented 1 year ago

@ormaaj

Thanx for clarifying :)

I re-re-read the bash/ksh man's and I see why I got confused by both

Bash

${parameter/pattern/string}
          Pattern substitution.  The pattern is expanded to produce a pat-
          tern  just  as  in pathname expansion, Parameter is expanded and
          ...

It don't mention ${parameter/#pattern/string} so i skept over, only reading in details see

 If pattern begins with #, it must  match  at  the  beginning..

Showing that /# indeed is special for bash :)

Ksh

   ${parameter//pattern/string}
   ${parameter/#pattern/string}
   ${parameter/%pattern/string}
          Expands parameter and replaces the longest match of pattern with
          the  given  string.  Each occurrence of \n in string is replaced....

Making me thinking /# is special, yet the text below don't reference # or %, it simply say

The third form restricts...

Making the /# hard to spot in the text, then forcing detail reading, and mental work to match "third for" to "/#", I guess I am not the only one reading the doccos 1 word every 10 words :)

You start by saying "This is useful for compatibility and I think it's supposed to work anyway." not saying compatibilty with 'what', I admit I introduced the bash in the post, but to me bash and ksh are not compatible at all (bash don't mention any back ref \1 \2 \nth) so it cross out bash, so did you meant compatibility with other ksh (ancient?) or other languages?

ormaaj commented 1 year ago

You start by saying "This is useful for compatibility and I think it's supposed to work anyway." not saying compatibilty with 'what

  $ ( echo; for sh in bash zsh mksh; do echo "$sh:"; "$sh" /dev/fd/9 9<<\EOF9; echo; done )
[[ -v ZSH_VERSION ]] && emulate ksh
x=abc
printf %s\\n "${x/#/x}" "${x/%/x}"
EOF9

bash:
xabc
abcx

zsh:
xabc
abcx

mksh:
xabc
abcx

I admit I introduced the bash in the post, but to me bash and ksh are not compatible at all (bash don't mention any back ref \1 \2 \nth) so it cross out bash, so did you meant compatibility with other ksh (ancient?) or other languages?

Bash, zsh, mksh (but not oksh) support a subset of the substitution expansion features. None of them support re-expanding the replacement on each word of a multi-word expansion, backrefs, grouping, or .sh.match, however all of them support these anchored pattern variants. bash 5.2 supports & in the replacement, which is equivalent to \0

Zsh as usual provides a zillion alternative features with special syntax to do the same thing.

phidebian commented 1 year ago

Ok I see now that "bash zsh mksh" are the one to check against. :)

Jut curious is there a wish to converge with say bash or zsh, via emulate mecanism, though I guest with zsh we would get kind of a loop ksh emulate zsh emultate ksh :)

I am asking because for this #558, may be there are 2 fixes, one compatible with bash and al. and one specific to ksh (if desirable)..,

ormaaj commented 1 year ago

Ok I see now that "bash zsh mksh" are the one to check against. :)

Sorta. They're useful data points. They all used ksh93 as the reference a lot of the time so just because everyone agrees doesn't mean it's correct. :/

phidebian commented 1 year ago

Ok I'll keep that in mind, so far I was more distracted about architecture specifics, on qemu linux I got an s390x up and running so I can check the build/test is not boroken in this arch. The other things I look is about other OS'es, I didn't knew haiku-os, but see it was referenced a couple of times si I installed an instance on my vbox, don't really see a market use, looks like a mono user os, dunno what is the use case of it, but the exercise is interesting. I come from the HP-UX world, but have no access to it nowaday, and making ksh93 build and run on HPUX would have been fun too. I have access to a mac min i M1(well in 2 weeks time) will see if I can use that too as yet another os/arch to test on....

ormaaj commented 1 year ago

Jut curious is there a wish to converge with say bash

...missed this. Yes the "core" language features should be "converged" so to speak, which is why it's important to study this family tree of languages and think carefully about their mechanics. This requires a great deal of knowledge and experience, being good at reasoning about the language idioms given the shell's random assortment of half-broken tools is essential.

There is also POSIX. POSIX is reactive, not proactive, in that it specifies existing bugs practice, so it should be referenced but taken with a grain of salt like anything else.

or zsh, via emulate mecanism, though I guest with zsh we would get kind of a loop ksh emulate zsh emultate ksh

zsh is "special" because it's wildly incompatible by default. It is best to always run zsh with the emulate ksh mode, which to my knowledge can't be toggled during invocation via options, so some boilerplate is required in scripts to make it sane.

I should really write out a "nice" boilerplate for testers to use. It's difficult for reasons. (I could really go to town on a project like that but don't want to waste a ton of time on it.) @McDutchie 's modernish is a nice thing to leverage for that, particularly the "capabilities" code https://github.com/modernish/modernish/tree/master/lib/modernish/cap . I am happy that he wrote this so that I didn't have to!

phidebian commented 1 year ago

Thanx @ormaaj for all thoses details, I have a lot of digest now :-), will review all this.

McDutchie commented 1 year ago

The current ksh93 behaviour, though unique, is logical. An empty glob pattern matches nothing, e.g.:

$ ls ""
ls: : No such file or directory

whereas an empty BRE or ERE matches everything, e.g.:

$ ls | grep ""
(all your files)

So, your reproducers:

  • ${x/#/y}

  • ${x/#~(K)/y}

both tell ksh to match an empty glob pattern at #, the beginning of the string (the ~(K) signifying a glob pattern, which is also the default). Empty glob patterns never match, so no substitution is done.

Having said that, mksh, bash, yash, and zsh have all special-cased this. I think we should do the same, because the strictly logical behaviour is not meeting user expectations and the special-casing is useful.

The empty shell pattern pattern is passed to strngrpmatch (for %, via substring) here: https://github.com/ksh93/ksh/blob/3dbb0889fcc0470383aefd8de2b5de24ce1fff21/src/cmd/ksh93/sh/macro.c#L1885-L1888

The behaviour for the first reproducer ${x/#/y}, as well as ${x/%/y}, can be fixed by the following workaround, which simply replaces an empty pattern with an ERE end or begin anchor.

diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index e8071f978..4513be322 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -1883,9 +1883,16 @@ retry2:
                    oldv = v;
                    nmatch_prev = nmatch;
                    if(c=='%')
-                       nmatch=substring(v,tsize,pattern,match,flag&STR_MAXIMAL);
+                       nmatch = substring(v, tsize,
+                           *pattern ? pattern : "~(E)$",
+                           match,
+                           flag & STR_MAXIMAL);
                    else
-                       nmatch=strngrpmatch(v,vsize,pattern,(ssize_t*)match,elementsof(match)/2,flag|STR_INT);
+                       nmatch = strngrpmatch(v, vsize,
+                           *pattern ? pattern : "~(E)^",
+                           (ssize_t*)match,
+                           elementsof(match) / 2,
+                           flag | STR_INT);
                    if(nmatch && repstr && !mp->macsub)
                        sh_setmatch(v,vsize,nmatch,match,index++);
                    if(nmatch)

The second reproducer ${x/#~(K)/y} still doesn't work with the patch, but I'm okay with that; you're explicitly circumventing the special-casing and basically saying "yes, I absolutely want to match the string against an empty glob pattern". So that case is not going to break user expectations.

McDutchie commented 1 year ago

Now, there's one question left to resolve. Since the current behaviour is not strictly a bug, the fix is an enhancement. Which means that, by policy, it shouldn't get backported to the 1.0 branch for the upcoming 1.0.5 release.

On the other hand, the fix provides compatibility with every other ksh-like shell supporting this operation (bash, mksh, yash, and zsh), and clearly matches user expectations. It seems that, from most users' perspectives, the current ksh93 behaviour is a bug.

So, would backporting this to 1.0 fix more existing scripts than it would break…? I think that might well be the case, so I'm inclined to backport it, but I'm open to opinions.

McDutchie commented 1 year ago

Also note that the fix only special-cases ${x/#/y} and ${x/%/y}, not ${x//#/y} and ${x//%/y}. But the latter two don't work on bash, yash or mksh either (though they do work on zsh).

hyenias commented 1 year ago

I would error on the side of caution by not back porting it to the 1.0 production branch as it could directly cause adverse behavior if that script was programmed against tested behavior. The world is a big place and some computer out their might be running it. Please save this enhancement change to 1.1 version. I agree that it would probably fix more scripts than break ones. Additionally, I do not see any harm in advancing the production to version 1.1 when you feel the time is right.

McDutchie commented 1 year ago

I have a hard time imagining how this would break any script, as the operation is currently a no-op, so would not be deliberately used except if the programmer expected it to work. But that doesn't mean it's impossible. People do unimaginable things all the time. :) So yeah, fair enough, it's probably better to err on the side of caution.

(As for releasing 1.1, we're nowhere near ready... deparse.c needs a lot of fixes to make the new 'typeset -f funcname' output reliable, for one.)

McDutchie commented 1 year ago

In a sleep deprivation moment, I wrote:

An empty glob pattern matches nothing, e.g.:

$ ls ""
ls: : No such file or directory

That was, of course, a silly example: the empty is quoted, so it's not a pattern. Unquoted empties are not a thing in the shell; they disappear, so we can't test them. But glob pattern matching engines really will not match the empty pattern. For example, the following test with fnmatch(3) prints "no match":

#include <stdio.h>
#include <fnmatch.h>
int main(void)
{
    printf("%s\n",fnmatch("","foo",0)==0 ? "match" : "no match");
    return 0;
}
McDutchie commented 1 year ago

With the 1.0.5 release around the corner, I am reconsidering the decision not to have this fix on the 1.0 branch.

On 1.0 and legacy ksh, ${x/#/y} is a no-op; it's equivalent to $x. There is no use case for that, so there is no good reason why any legacy script would use it – except, of course, if the author expected it to work. In which case, fixing this in ksh means fixing that script. Fixing it will therefore likely do more good than harm.

The existing behaviour may have been technically correct (an empty glob pattern matches nothing even if anchored), but that is a very obscure kind of technically correct. Fixing it will align ksh 93u+m/1.0.5 with reasonable user expectations (particularly since this has been working as expected for many years on mksh, zsh and bash).

Between that, and the fact that there is no use case for the original behaviour, fixing it seems like the right thing to do – particularly as the 1.0 branch has already seen fixes for clear design flaws (such as the octal leading zero mess, or matching '.' and '..' in globbing) that are actually more radical than this one.

ormaaj commented 1 year ago

Yeah I wasn't aware this was an exceptional case. I guess I'm more swayed by the reasoning that it was unlikely to be used deliberately considering it would have never worked, and unlikely to impact anyone. I have nothing that depends on the stable releases and generally run the dev branch so it doesn't affect me.