Crash when running `set` in a subshell/comsub with discipline function defined

McDutchie commented 1 year ago

set outputs the values of all the variables. This crashes in the following reproducer:

function GIT_BRANCH.get
{
        command -v git >/dev/null || return
        .sh.value=${ git branch 2>/dev/null; }
        case $'\n'${.sh.value} in
        *$'\n* '*)
                .sh.value=$'\n'${.sh.value}
                .sh.value=${.sh.value#*$'\n* '}
                .sh.value=${.sh.value%%$'\n'*}
                ;;
        *)      .sh.value=''
                ;;
        esac
}
v=$(set)

McDutchie commented 1 year ago

ASan stack trace (not that I can make head or tail of what the heck is going on):

$ arch/darwin.arm64-64/bin/ksh
$ . ./issue616.sh                                                                                                                  
=================================================================
==56612==ERROR: AddressSanitizer: heap-use-after-free on address 0x000106f03880 at pc 0x00010304eda0 bp 0x00016cf8ccd0 sp 0x00016cf8ccc8
READ of size 1 at 0x000106f03880 thread T0
    #0 0x10304ed9c in sfvprintf sfvprintf.c:717
    #1 0x103034af0 in sfprintf sfprintf.c:31
    #2 0x102f26390 in nv_getval name.c:2663
    #3 0x102e95b18 in print_namval typeset.c:1528
    #4 0x102e9408c in print_scan typeset.c:1631
    #5 0x102e9447c in b_set typeset.c:1267
    #6 0x102f5e620 in sh_exec xec.c:1261
    #7 0x102f51844 in sh_subshell subshell.c:651
    #8 0x102f0bac0 in comsubst macro.c:2238
    #9 0x102f0d624 in varsub macro.c:1203
    #10 0x102f06f48 in copyto macro.c:623
    #11 0x102f0591c in sh_mactrim macro.c:171
    #12 0x102f1913c in nv_setlist name.c:277
    #13 0x102f5ca58 in sh_exec xec.c:1059
    #14 0x102f58830 in sh_exec xec.c:1944
    #15 0x102f577bc in sh_eval xec.c:666
    #16 0x102e7aaf4 in b_dot_cmd misc.c:317
    #17 0x102f5e620 in sh_exec xec.c:1261
    #18 0x102e9c410 in exfile main.c:608
    #19 0x102e9a0a8 in sh_main main.c:371
    #20 0x102e6f83c in main pmain.c:42
    #21 0x10362d088 in start+0x204 (dyld:arm64e+0x5088)

0x000106f03880 is located 0 bytes inside of 869-byte region [0x000106f03880,0x000106f03be5)
freed by thread T0 here:
    #0 0x103a4ade4 in wrap_free+0x98 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3ede4)
    #1 0x102fd3990 in stkcopy stk.c:479
    #2 0x102ea32dc in lookup nvdisc.c:426
    #3 0x102e9ffb0 in lookups nvdisc.c:457
    #4 0x102e9d970 in nv_getv nvdisc.c:59
    #5 0x102f26764 in nv_getval name.c:2673
    #6 0x102e95b18 in print_namval typeset.c:1528
    #7 0x102e9408c in print_scan typeset.c:1631
    #8 0x102e9447c in b_set typeset.c:1267
    #9 0x102f5e620 in sh_exec xec.c:1261
    #10 0x102f51844 in sh_subshell subshell.c:651
    #11 0x102f0bac0 in comsubst macro.c:2238
    #12 0x102f0d624 in varsub macro.c:1203
    #13 0x102f06f48 in copyto macro.c:623
    #14 0x102f0591c in sh_mactrim macro.c:171
    #15 0x102f1913c in nv_setlist name.c:277
    #16 0x102f5ca58 in sh_exec xec.c:1059
    #17 0x102f58830 in sh_exec xec.c:1944
    #18 0x102f577bc in sh_eval xec.c:666
    #19 0x102e7aaf4 in b_dot_cmd misc.c:317
    #20 0x102f5e620 in sh_exec xec.c:1261
    #21 0x102e9c410 in exfile main.c:608
    #22 0x102e9a0a8 in sh_main main.c:371
    #23 0x102e6f83c in main pmain.c:42
    #24 0x10362d088 in start+0x204 (dyld:arm64e+0x5088)

previously allocated by thread T0 here:
    #0 0x103a4aca8 in wrap_malloc+0x94 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3eca8)
    #1 0x102fd37ec in stkcopy stk.c:452
    #2 0x102ea32dc in lookup nvdisc.c:426
    #3 0x102e9ffb0 in lookups nvdisc.c:457
    #4 0x102e9d970 in nv_getv nvdisc.c:59
    #5 0x102f26764 in nv_getval name.c:2673
    #6 0x102e95b18 in print_namval typeset.c:1528
    #7 0x102e9408c in print_scan typeset.c:1631
    #8 0x102e9447c in b_set typeset.c:1267
    #9 0x102f5e620 in sh_exec xec.c:1261
    #10 0x102f51844 in sh_subshell subshell.c:651
    #11 0x102f0bac0 in comsubst macro.c:2238
    #12 0x102f0d624 in varsub macro.c:1203
    #13 0x102f06f48 in copyto macro.c:623
    #14 0x102f0591c in sh_mactrim macro.c:171
    #15 0x102f1913c in nv_setlist name.c:277
    #16 0x102f5ca58 in sh_exec xec.c:1059
    #17 0x102f58830 in sh_exec xec.c:1944
    #18 0x102f577bc in sh_eval xec.c:666
    #19 0x102e7aaf4 in b_dot_cmd misc.c:317
    #20 0x102f5e620 in sh_exec xec.c:1261
    #21 0x102e9c410 in exfile main.c:608
    #22 0x102e9a0a8 in sh_main main.c:371
    #23 0x102e6f83c in main pmain.c:42
    #24 0x10362d088 in start+0x204 (dyld:arm64e+0x5088)

SUMMARY: AddressSanitizer: heap-use-after-free sfvprintf.c:717 in sfvprintf
Shadow bytes around the buggy address:
  0x007020e006c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007020e006d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007020e006e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007020e006f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007020e00700: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x007020e00710:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x007020e00720: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x007020e00730: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x007020e00740: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x007020e00750: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x007020e00760: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==56612==ABORTING
Abort

posguy99 commented 1 year ago

That's... weird.

I put it in a .sh, run it, the subshell doesn't crash, v gets set, all is happy

I add GIT_BRANCH.get to my shell startup, and v=$(set) crashes the shell.

I have two other .get disciplines and one .set discipline in my normal environment, and v=$(set) doesn't crash the shell. Just this .get discipline crashes it.

dev@754234d7 here.

posguy99 commented 1 year ago

If I run it ksh ./git_branch.ksh, the shell doesn't crash. If I source it . ./git_branch.ksh, the shell crashes.

My head hurts.

I have this in my normal environment

function _git_status.get                                                                                        
 {                                                                                                               
      typeset branch commit return                                                                                
      branch=$(git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/[\1]/')                                 
      [[ -n "$branch" ]] && commit=$(git rev-parse --short HEAD 2> /dev/null) && return="${branch%]}@${commit}]"  
      .sh.value=${return}                                                                                         
 }

and v=$(set) doesn't crash.

McDutchie commented 1 year ago

The problem seems to be the use of the stack (the stkcopy() call in nvdisc.c:426). That may not be compatible with virtual subshells.

If we replace cp = stkcopy(stkstd,cp); on that line by cp = sh_strdup(cp); to use the heap instead, ~~the crash disappears~~, but of course a memory leak is left in its place.

edit: No, it doesn't. It just becomes intermittent, at least on my system.

McDutchie commented 1 year ago

My head hurts.

These crashes are due to "undefined behaviour" which is notoriously unpredictable.

posguy99 commented 1 year ago

These crashes are due to "undefined behaviour" which is notoriously unpredictable.

But this discipline function crashes it and the two I normally have in there don't?

posguy99 commented 1 year ago

In my testing here so far, I only get the fault if git branch 2>/dev/null actually returns something, i.e. you're actually in a repository.

Something weird I noticed. I replaced my _git_status.get with your git_branch.get (renaming the variable to _git_status), and the presentation in the environment is different. If it's null, then you see something named .sh._git_status in the environment, and if it's non-null, then you see _git_status in the environment. With my original _git_status.get, the variable in the environment is always named _git_status, whether it's null or not.

phidebian commented 1 year ago

Sure my head hurt too.

To me it all comes from macro.c::comsubst()

            if(type==2 && sh.subshell && !sh.subshare)
                sh_subfork();   /* subshares within virtual subshells are broken, so fork first */
            sp = sh_subshell(t,sh_isstate(SH_ERREXIT),type);

Looks like comsubst() is a trouble zone and in trouble zone, trouble are found (happy silicon valley).

To fix this I need explanations from guru's out there.

To me all this jazz can be streamlined to

function Z.get { .sh.value=${ /bin/echo yo ; } ; } 
echo "$(set)"

First head punch macro.c::varsub()

/*
* This routine handles $param, ${param}, and ${param op word}
* The input stream is assumed to be a string
*/

Apparently we enter here with $(...) too, may be this is just a comment glitch, may be archelogist can figure out that $(...) may have been added late...

Second head punch macro.c::comsubst()

/*
* This routine handles command substitution
* <type> is 0 for older `...` version
* 1 for $(...) or 2 for ${ subshare; }
*/
static void comsubst(Mac_t *mp,Shnode_t* t, int type)

Jeez we enter here with type=3 and no explanation anywere what 3 could mean.

but sure enough at line 2236 or so the infamous

            if(type==2 && sh.subshell && !sh.subshare)
                sh_subfork();   /* subshares within virtual subshells are broken, so fork first */
            sp = sh_subshell(t,sh_isstate(SH_ERREXIT),type);

Since we enter with 3 we go the sh_subshell way on the first comsubst() i.e $(set) this any leads to print_scan() leads to print_namval() this one find the kind of $Z with a ${...} in there i.e '${ /bin/echo yo ; } ` this is yet another comsubst() that seem not reentrant regarding subshare, then kaboom.

I made a 'patch' that is not a fix but a demonstrator of all this explanation.

above the

                sh_subfork();   /* subshares within virtual subshells are broken, so fork first */

I add this

if(t->tre.tretyp==0)
{ char *p=save.fcbuff;
  while(*p&&*p!='$')p++;
  if(*p && p[1]=='(')p+=2;
  while(*p&&(*p==' ')||(*p=='\t'))p++;
  if(*p&&p[0]=='s'&&p[1]=='e'&&p[2]=='t')p+=3;
  while(*p&&(*p==' ')||(*p=='\t'))p++;
  if(*p==')')
  { type=2;
  }
}

I know looks ugly, don't punch me on this, this is just a demonstrator kludge.

if(t->tre.tretyp==0) it is a COM, then scan it for a ...=$(...set...[^-]) if so got thetype=2way i.e fork() for good. If we got $(set -...) we got thetype=3way as before as options are not goingprintf_scan()`.

We got to do it on 1st comsubst() occurence, not the second one where it is too late, we are hosed already.

The patch https://github.com/phidebian/ksh/tree/bug-616

Enlight me about this type=3 thing

Cheers,

McDutchie commented 1 year ago

I think the type==3 thing is a red herring.

It was introduced by backporting a couple of Red Hat patches, here: 970069a6feb71424f4a98b9f3005181eeaa1c448

I radically simplified the amazing mess of hacks that were the multiple ksh command substitutions in 42becab63cae760addae6e88008665be352bdbbd (note: the "type" referred to there is not the same as the type parameter passed to comsubst()), 7c3868b8bdba9e276ddadf8e9df81f5872631f8b and bbcadf7020bc0246f5288b78fea270b30df4bda5.

Right now my head is hurting as well and I'm not clear that all the current comments about what the types are, are correct.

But I'm pretty sure that type==3 passed to comsubst() is now treated exactly the same as type==1 and I simply missed that in my efforts to clean that mess up. The following patch gets rid of it and does not cause any regression test failures on my end. (It doesn't fix the present crashing bug either, though.)

diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index 4513be322..11c37dd37 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -377,7 +377,7 @@ void sh_machere(Sfio_t *infile, Sfio_t *outfile, char *string)
                break;
                }
                case S_PAR:
-               comsubst(mp,NULL,3);
+               comsubst(mp,NULL,1);
                break;
                case S_EOF:
                if((c=fcfill()) > 0)
@@ -1200,7 +1200,7 @@ retry1:
        case S_PAR:
        if(type)
            goto nosub;
-       comsubst(mp,NULL,3);
+       comsubst(mp,NULL,1);
        return 1;
        case S_DIG:
        var = 0;
@@ -2229,7 +2229,6 @@ static void comsubst(Mac_t *mp,Shnode_t* t, int type)
                char *cp = (char*)sh_malloc(IOBSIZE+1);
                sp = sfnew(NULL,cp,IOBSIZE,fd,SF_READ|SF_MALLOC);
            }
-           type = 3;
        }
        else
        {

McDutchie commented 1 year ago

Your demonstrator patch is a workaround that causes sh_subfork() to be called if the command in $(...) is set. This workaround can be done in shell as well. If we add a ulimit command to the command substitution in the last line in the reproducer, for example

v=$(ulimit -c 0; set)

then the bug goes away as any ulimit invocation will force a virtual subshell to fork.

Perhaps we're forking at the wrong point in the code path for subshares within virtual subshells.

McDutchie commented 1 year ago

Actually, the forking of virtual subshells containing subshares is a red herring, too. If we remove it:

demonstration patch (breaks things, do not commit)

```diff diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c index 4513be322..15b834612 100644 --- a/src/cmd/ksh93/sh/macro.c +++ b/src/cmd/ksh93/sh/macro.c @@ -2232,11 +2232,7 @@ static void comsubst(Mac_t *mp,Shnode_t* t, int type) type = 3; } else - { - if(type==2 && sh.subshell && !sh.subshare) - sh_subfork(); /* subshares within virtual subshells are broken, so fork first */ sp = sh_subshell(t,sh_isstate(SH_ERREXIT),type); - } fcrestore(&save); } else ```

…then it crashes just the same.

phidebian commented 1 year ago

It effectivly strange to do things like $(set) where set will list (printf_scan) and access variables variables (print_nameval) that are in the process to be modified (discipline), specially when printf_scan() reach _ where it get its value, then _ is setup for discipline function, it is a be disorientating :) dunno if this important or not.

You right this kinda patch, simply avoid doing set in subshell, its a kludge to at least avoid this core dump.

I think the test about fork vs subshell is a problem, theorically speaking on the 1st occurence we don't know yet (beside the kludge sneaking in the fcbuff) if going subshell will be fatal, and on the second occurence (of comsubst) it is too late, so that's why I made this horrible thing as an headsup, to focus on this part.

I didn't knew this RH thing, interestig I like your historicall analysis :-)

McDutchie commented 1 year ago

I've found the following patch to be an effective workaround for this bug:

diff --git a/src/cmd/ksh93/bltins/typeset.c b/src/cmd/ksh93/bltins/typeset.c
index b5b0330d5..544d46110 100644
--- a/src/cmd/ksh93/bltins/typeset.c
+++ b/src/cmd/ksh93/bltins/typeset.c
@@ -1432,6 +1432,15 @@ static int print_namval(Sfio_t *file,Namval_t *np,int flag, struct tdata *tp)
    int indent=tp->indent, outname=0, isfun;
    char    tempexport=0;
    sh_sigcheck();
+   /*
+    * Printing a name-value pair can cause a discipline shell function to be executed
+    * as the value is obtained via nv_getval(). This can cause messy interactions with
+    * further levels of virtual subshell or subshare that may cause the shell to crash
+    * in certain corner cases. To be safe, fork a virtual subshell early.
+    * https://github.com/ksh93/ksh/issues/616
+    */
+   if(sh.subshell && !sh.subshare)
+       sh_subfork();
    if(flag)
        flag = '\n';
    if(tp->noref && nv_isref(np))

phidebian commented 1 year ago

I'm away from my laptop right now but looks appealing will test it tonight.:-)

McDutchie commented 1 year ago

I think we can make the forking workaround more fine-grained than that. It is only necessary if we need to call a shell discipline function to obtain the value. It is not needed for internal C discipline functions for special variables, like IFS.

There is currently no way to check if a variable has a shell discipline function to get the value. What is needed is to check if the getval or getnum discipline function pointers are set to lookups or lookupn. Those are the C functions that handle the shell .get and .getn disciplines. But they are static functions local to nvdisc.c, so we cannot currently access those names in typeset.c to check for them.

The patch below renames these for legibility (to sh_disc_getstring() and sh_disc_getnum(), respectively) and turns them into externs. For consistency, assign() is similarly renamed to sh_disc_assign(); we might want to check for this somewhere in future.

We can then loop through each variable's linked list of disciplines (algorithm copied from nv_hasdisc() in nvdisc.c) and check if a shell .get or .getn discipline was set. In that case, we fork as a workaround.

More fine-grained patch

```diff diff --git a/src/cmd/ksh93/bltins/typeset.c b/src/cmd/ksh93/bltins/typeset.c index b5b0330d5..097cdc750 100644 --- a/src/cmd/ksh93/bltins/typeset.c +++ b/src/cmd/ksh93/bltins/typeset.c @@ -1432,6 +1432,25 @@ static int print_namval(Sfio_t *file,Namval_t *np,int flag, struct tdata *tp) int indent=tp->indent, outname=0, isfun; char tempexport=0; sh_sigcheck(); + if(sh.subshell && !sh.subshare) + { + /* + * If obtaining the value (using nv_getval() below) will trigger the execution of a .get or .getn shell + * discipline function, and we're already in a virtual subshell, then corner case interactions with any + * nested virtual subshells may cause a heap-use-after-free in lookup() in nvdisc.c. Avoid by forking the + * subshell. TODO: somehow fix the heap-use-after-free instead. https://github.com/ksh93/ksh/issues/616 + */ + Namfun_t *fp; + for(fp = np->nvfun; fp; fp = fp->next) + { + if(fp->disc && (fp->disc->getval==sh_disc_getstring || fp->disc->getnum==sh_disc_getnum)) + { +error(ERROR_warn(0),"[DEBUG] forking at %s",nv_name(np)); + sh_subfork(); + break; + } + } + } if(flag) flag = '\n'; if(tp->noref && nv_isref(np)) diff --git a/src/cmd/ksh93/include/variables.h b/src/cmd/ksh93/include/variables.h index 842ddb2d6..d3a83cea8 100644 --- a/src/cmd/ksh93/include/variables.h +++ b/src/cmd/ksh93/include/variables.h @@ -43,6 +43,11 @@ extern void sh_save_rand_seed(struct rand *, int); 1 \ ) +/* for discipline shell functions (nvdisc.c) */ +extern char* sh_disc_getstring(Namval_t*,Namfun_t*); +extern Sfdouble_t sh_disc_getnum(Namval_t*,Namfun_t*); +extern void sh_disc_assign(Namval_t*,const char*,int,Namfun_t*); + /* The following defines must be kept synchronous with shtab_variables[] in data/variables.c */ #define PATHNOD (sh.bltin_nodes) diff --git a/src/cmd/ksh93/sh/nvdisc.c b/src/cmd/ksh93/sh/nvdisc.c index 4fc45380f..7745468c8 100644 --- a/src/cmd/ksh93/sh/nvdisc.c +++ b/src/cmd/ksh93/sh/nvdisc.c @@ -29,8 +29,6 @@ #include "io.h" #include "shlex.h" -static void assign(Namval_t*,const char*,int,Namfun_t*); - int nv_compare(Dt_t* dict, void *sp, void *dp, Dtdisc_t *disc) { if(sp==dp) @@ -138,7 +136,7 @@ void nv_putv(Namval_t *np, const char *value, int flags, Namfun_t *nfp) if(!nv_isattr(np,NV_NODISC) || fp==(Namfun_t*)nv_arrayptr(np)) break; } - if(!value && (flags&NV_TYPE) && fp && fp->disc->putval==assign) + if(!value && (flags&NV_TYPE) && fp && fp->disc->putval==sh_disc_assign) fp = 0; if(fp && fp->disc->putval) (*fp->disc->putval)(np,value, flags, fp); @@ -236,7 +234,7 @@ static void chktfree(Namval_t *np, struct vardisc *vp) /* * This function performs an assignment disc on the given node */ -static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) +void sh_disc_assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) { int type = (flags&NV_APPEND)?APPEND:ASSIGN; struct vardisc *vp = (struct vardisc*)handle; @@ -452,12 +450,12 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) return cp; } -static char* lookups(Namval_t *np, Namfun_t *handle) +char* sh_disc_getstring(Namval_t *np, Namfun_t *handle) { return lookup(np,LOOKUPS,NULL,handle); } -static Sfdouble_t lookupn(Namval_t *np, Namfun_t *handle) +Sfdouble_t sh_disc_getnum(Namval_t *np, Namfun_t *handle) { Sfdouble_t d; lookup(np,LOOKUPN, &d ,handle); @@ -479,7 +477,7 @@ char *nv_setdisc(Namval_t* np,const char *event,Namval_t *action,Namfun_t *fp) char *empty = ""; while(vp) { - if(vp->fun.disc && (vp->fun.disc->setdisc || vp->fun.disc->putval == assign)) + if(vp->fun.disc && (vp->fun.disc->setdisc || vp->fun.disc->putval == sh_disc_assign)) break; vp = (struct vardisc*)vp->fun.next; } @@ -536,7 +534,7 @@ char *nv_setdisc(Namval_t* np,const char *event,Namval_t *action,Namfun_t *fp) /* Handle GET/SET/APPEND/UNSET disc */ if(np==SH_VALNOD || np==SH_LEVELNOD) return NULL; - if(vp && vp->fun.disc->putval!=assign) + if(vp && vp->fun.disc->putval!=sh_disc_assign) vp = 0; if(!vp) { @@ -548,7 +546,7 @@ char *nv_setdisc(Namval_t* np,const char *event,Namval_t *action,Namfun_t *fp) vp->fun.disc = dp; memset(dp,0,sizeof(*dp)); dp->dsize = sizeof(struct vardisc); - dp->putval = assign; + dp->putval = sh_disc_assign; if(nv_isarray(np) && !nv_arrayptr(np)) nv_putsub(np,NULL, 1); nv_stack(np, (Namfun_t*)vp); @@ -562,9 +560,9 @@ char *nv_setdisc(Namval_t* np,const char *event,Namval_t *action,Namfun_t *fp) { Namdisc_t *dp = (Namdisc_t*)vp->fun.disc; if(type==LOOKUPS) - dp->getval = lookups; + dp->getval = sh_disc_getstring; else if(type==LOOKUPN) - dp->getnum = lookupn; + dp->getnum = sh_disc_getnum; vp->disc[type] = action; } else ```

Of course, it would be even better if a way could be found to fix the damn heap-use-after-free...

McDutchie commented 1 year ago

Trying to write a good regression test. This crash is such a heisenbug, I have not found a way to reproduce it in a regression test yet. Seems like this crash depends on something in my .kshrc.

Minimal reproducer:

GIT_BRANCH.get()
{
    .sh.value=${ echo foo; }
}
v=$(set)

On my end (with unpatched ksh):

$ ksh --rc issue616.sh
Memory fault
$ ksh --norc issue616.sh
(no crash)

McDutchie commented 1 year ago

OK, so the trigger seems to be that my .kshrc defines other discipline functions as well. Though systematic elimination I've found that we need at least two discipline functions to trigger the crash. New minimal reproducer:

RELATIVE_PWD.get()
{
        .sh.value=foo;
}
GIT_BRANCH.get()
{
        .sh.value=${ echo foo; }
}
v=$(set)

This one crashes (on my end) by simply running ksh issue616.sh.

McDutchie commented 1 year ago

As expected, it also crashes with v=$(typeset -p) instead of v=$(set).

phidebian commented 1 year ago

I like your patch, will double checkit. The idea of 'mine' was trying to keep as much as the existing behavior as possible, i.e limit the early fork for now only on set because I thought that may be this fork avoidance was a kinda optimisation, so I didn't want to penalize all the other 'normal' case.

OK, so the trigger seems to be that my .kshrc defines other discipline functions as well. Though systematic elimination I've found that we need at least two discipline functions to trigger the crash. New minimal reproducer:
RELATIVE_PWD.get()
{
        .sh.value=foo;
}
GIT_BRANCH.get()
{
        .sh.value=${ echo foo; }
}
v=$(set)
This one crashes (on my end) by simply running ksh issue616.sh.

I think you will be puzzled by this one :-) still on prod ksh, --rc vs --norc seems non relevant (with my rc that may be is not as complicated as yours :-) so I keep --norc here for getting comparable results.

Note that my examples I use /bin/echo, this was to simulate a real fork/exec like the git initial example, but on the minimal reproducer this is not necessary.

$ ksh --norc -c 'function Z.get { .sh.value=${ echo yo ; } ; } ; b=$(set)'
Bus error
$ ksh --norc -c 'function z.get { .sh.value=${ echo yo ; } ; } ; b=$(set)'

The later don't crash :-) Well no quite, it crash on second instance :-)

$ ksh --norc -c 'function z.get { .sh.value=${ echo yo ; } ; } ; b=$(set);b=$(set)' # <-- Double call here
Bus error

To me this has to do with the variable name Z vs z, one is before _ the other after, I didn't investigate that part yet, but a fear 'intuitively' that something bad may happen around $_ variable. Got no proof of that, so may be this is not related at all.

But in all case

As expected, it also crashes with v=$(typeset -p) instead of v=$(set).

Yes make sense, any caller to print_scan() in a virtual subshell will :)

posguy99 commented 1 year ago

Using the patch along with my normal startup, I get this now:

[846] mbp13 $ arch/darwin.arm64-64/bin/ksh                                                                             

[847] mbp13 $ v=$(set)                                                                                                 
arch/darwin.arm64-64/bin/ksh: set: warning: [DEBUG] forking at DATE

[848] mbp13 $

But I don't get the memory fault.

DATE is defined in my startup as

function DATE.get { .sh.value=$(date +%D) ; }

McDutchie commented 1 year ago

The [DEBUG] line is temporary, it confirms it's doing what it's supposed to. I'll delete it before committing the fix to git. To get rid of it, delete the corresponding line from typeset.c and rebuild.

phidebian commented 1 year ago

Well, dunno why I didn't thought earlier, when I discovered that the name of the variable did matter, i.e Z is above _ and z is below, I feared that interaction with discipline messing with _ and print_nameval accessing _ could be problematic.

I decided to make a brutal kludge, i.e skipping over _ in print_scan (src/cmd/ksh93/bltins/typeset.c)

        if((np=nv_search(*argv++,root,0)) && np!=onp && (!nv_isnull(np) || np->nvfun || nv_isattr(np,~NV_NOFREE)))
        {
            onp = np;
                       /* kludge start */
                        if( (np->nvname[0]=='_')&&(np->nvname[1]==0) )  
                        { continue;
                        }
                        /* kludge end */

And to my grand surprise, no more core dump, and no regression in QA, even though now b=$(set) miss the _ variable, IMHO this is not a big lost and ledgitmate to withdraw, as we withdraw .sh etc... the value of _ at set time will be overloaded right after b=$(set) as _=$(set)

Anyway this means that may be all the things with fork() is still good, only the _ access in print_scan is not good, I noticed that when not doing the kludge, i.e that actual prod path, then effectivly np pointer got corrupted data leading to crashdump.

Is that a trail to follow ?

EDIT: Well now I got no core dump, but later echo $_ do core dump, so _ is corrupted anyway, simply not being printed in the print_scan() differ it. So may be fork() is back :)

McDutchie commented 1 year ago

Very interesting trail that, I will pursue it. I do disagree that it is legitimate to skip _ in the output of set.

McDutchie commented 1 year ago

BTW, for future reference, something like

                        if( (np->nvname[0]=='_')&&(np->nvname[1]==0) )

should be written as

                        if(np==L_ARGNOD)

See src/cmd/ksh93/include/variables.h and src/cmd/ksh93/data/variables.c.

phidebian commented 1 year ago

Cool learned something :)

McDutchie commented 1 year ago

So, printing out name-value pairs with set or typeset -p doesn't really have anything to do with this bug -- it was just a trigger. As you've found, corruption of $_ a.k.a. L_ARGNOD is the real bug. That corruption occurs in the main shell environment when these two discipline functions are executed in a virtual subshell, in this order:

foo.get()
{
        .sh.value=foo;
}
bar.get()
{
        .sh.value=${ echo foo; }
}
(
        : $bar
        : $foo
)
echo $_

McDutchie commented 1 year ago

lldb backtrace with the above reproducer:

$ lldb -o run -- arch/darwin.arm64-64/bin/ksh issue616.sh
(lldb) target create "arch/darwin.arm64-64/bin/ksh"
Current executable set to '/usr/local/src/ksh93/ksh/arch/darwin.arm64-64/bin/ksh' (arm64).
(lldb) settings set -- target.run-args  "issue616.sh"
(lldb) run
Process 89854 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x30)
    frame #0: 0x000000010006de20 ksh`nv_create(name="_", root=0x854a80010006cf30, flags=1441792, dp=0x000000016fdfe1c8) at name.c:906:17
   903                      if(n)
   904                      {
   905                          ap = nv_arrayptr(np);
-> 906                          ap->nelem = dim;
   907                          nv_putsub(np,NULL,n);
   908                      }
   909                      else
Target 0: (ksh) stopped.
Process 89854 launched: '/usr/local/src/ksh93/ksh/arch/darwin.arm64-64/bin/ksh' (arm64)
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x30)
  * frame #0: 0x000000010006de20 ksh`nv_create(name="_", root=0x854a80010006cf30, flags=1441792, dp=0x000000016fdfe1c8) at name.c:906:17
    frame #1: 0x000000010006bab4 ksh`nv_open(name="_", root=0x0000600002102990, flags=1441792) at name.c:1401:7
    frame #2: 0x0000000100062dc0 ksh`varsub(mp=0x00006000017001c0) at macro.c:1359:9
    frame #3: 0x000000010005ed64 ksh`copyto(mp=0x00006000017001c0, endch=0, newquote=0) at macro.c:623:21
    frame #4: 0x0000000100060220 ksh`sh_macexpand(argp=0x000000010080f078, arghead=0x000000016fdfea30, flag=0) at macro.c:235:2
    frame #5: 0x000000010002f9d4 ksh`arg_expand(argp=0x000000010080f078, argchain=0x000000016fdfea30, flag=0) at args.c:818:11
    frame #6: 0x000000010002f688 ksh`sh_argbuild(nargs=0x000000016fdff0ac, comptr=0x000000010080f038, flag=0) at args.c:658:9
    frame #7: 0x0000000100090e34 ksh`sh_exec(t=0x000000010080f038, flags=5) at xec.c:928:10
    frame #8: 0x0000000100020480 ksh`exfile(iop=0x0000600003500370, fno=10) at main.c:608:4
    frame #9: 0x000000010001f58c ksh`sh_main(ac=2, av=0x000000016fdff6f0, userinit=0x0000000000000000) at main.c:371:2
    frame #10: 0x000000010000465c ksh`main(argc=2, argv=0x000000016fdff6f0) at pmain.c:42:9
    frame #11: 0x00000001001ed08c dyld`start + 520

McDutchie commented 1 year ago

Relevant information from sh.1:

              _      Initially, the value of _ is an absolute pathname of the
                     shell or script being executed as passed in the
                     environment.  Subsequently it is assigned the last argument
                     of the previous command.  This parameter is not set for
                     commands which are asynchronous.  This parameter is also
                     used to hold the name of the matching MAIL file when
                     checking for mail.  While defining a compound variable or a
                     type, _ is initialized as a reference to the compound
                     variable or type.  When a discipline function is invoked, _
                     is initialized as a reference to the variable associated
                     with the call to this function.  Finally when _ is used as
                     the name of the first variable of a type definition, the
                     new type is derived from the type of the first variable.
                     (See Type Variables  below.)

McDutchie commented 1 year ago

"When a discipline function is invoked, _ is initialized as a reference to the variable associated with the call to this function." This happens in set_instance and unset_instance in xec.c.

Here is a test patch that disables this functionality. It makes the crash go away on my end.

```diff diff --git a/src/cmd/ksh93/sh/xec.c b/src/cmd/ksh93/sh/xec.c index da9245d47..f6abf4437 100644 --- a/src/cmd/ksh93/sh/xec.c +++ b/src/cmd/ksh93/sh/xec.c @@ -750,11 +750,11 @@ static int set_instance(Namval_t *nq, Namval_t *node, struct Namref *nr) #endif /* SHOPT_NAMESPACE */ } nv_putval(SH_NAMENOD, cp, NV_NOFREE); - memcpy(node,L_ARGNOD,sizeof(*node)); - L_ARGNOD->nvalue.nrp = nr; - L_ARGNOD->nvflag = NV_REF|NV_NOFREE; - L_ARGNOD->nvfun = 0; - L_ARGNOD->nvenv = 0; +// memcpy(node,L_ARGNOD,sizeof(*node)); +// L_ARGNOD->nvalue.nrp = nr; +// L_ARGNOD->nvflag = NV_REF|NV_NOFREE; +// L_ARGNOD->nvfun = 0; +// L_ARGNOD->nvenv = 0; if(sp) { nv_putval(SH_SUBSCRNOD,nr->sub=sp,NV_NOFREE); @@ -765,9 +765,9 @@ static int set_instance(Namval_t *nq, Namval_t *node, struct Namref *nr) static void unset_instance(Namval_t *nq, Namval_t *node, struct Namref *nr,long mode) { - L_ARGNOD->nvalue.nrp = node->nvalue.nrp; - L_ARGNOD->nvflag = node->nvflag; - L_ARGNOD->nvfun = node->nvfun; +// L_ARGNOD->nvalue.nrp = node->nvalue.nrp; +// L_ARGNOD->nvflag = node->nvflag; +// L_ARGNOD->nvfun = node->nvfun; if(nr->sub) { nv_putsub(nr->np, nr->sub, mode); ```

McDutchie commented 1 year ago

I updated and tested my stash of ksh 93u+m commit builds.

Commit 88a1f3d661f35282a1552abe74ea70ee24cd4aa9 broke it: as of that, $_ outputs an empty value in the reproducer.

Commit 430e47813cacf7366ea4ce03d170b254447c331c introduced the crash.

phidebian commented 1 year ago

Ha you beat me on this one, I learned how to monitor L_ARGNOD, thank you for that, and I discovered this.

print_scan()  loop until np->nvname=="Z" at this point L_ARGON is OK
   print_namval(np,...)
      nv_getval(np)
          nv_getv(np)
              lookups(np)
                   lookup(np...)

                     ====  Here may be the begining of problems... ====
  >407      block(bp,type);                                                  
   408      block(bp, UNASSIGN);   /* make sure nv_setdisc doesn't invalidate
   409      sh_pushcontext(&checkpoint, 1);                                  
   410      jmpval = sigsetjmp(checkpoint.buff, 0);       <==== setjmp No 1 here                   
   411      if(!jmpval)                                                      
   412         sh_fun(nq,np,NULL);                                      <==== will invoke Z.get()      
   413      sh_popcontext(&checkpoint);        

                   lookup(np) 
                        sh_fun(np)
                        ==== Second part of the problem 'may be' =====

  >3240   if(nq)                                                             
   3241     mode = set_instance(nq,&node, &nr);       <==== Save L_ARGNOD into &node                        
   3242   if(is_abuiltin(np))                                                
   3243   {                                                                  
 ......
   3266   }                                                                  
   3267   else                                                               
   3268     sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT));    <=== Here problem with sh_function do longjmp             
   3269   if(nq)                                                +=== to setjmp No1 so we never restore L_ARGNOD
   3270     unset_instance(nq, &node, &nr, mode);       <==== restore L_ARGNOD    NEVER reached if the above longjmp           
   3271   fcrestore(&save);

I am not fluent enough with the setjmp/longjmp pushcontext/popcontext, I dunno how they are linked together etc... but I think if you manage set a setjmp/longjmp point so on error in sh_funct() we longjmp back to the restore point for L_ARGNOD, you would be safe and no need update set_instance() unset_instance() unless theire are bugged.

McDutchie commented 1 year ago

Progress: the following patch fixes the crash introduced in 430e47813cacf7366ea4ce03d170b254447c331c on my end, leaving only the earlier breakage introduced earlier in 88a1f3d661f35282a1552abe74ea70ee24cd4aa9.

Learning that the set_instance and unset_instance functions were the direct cause of the crash made me look into how and where they are called. And I found an obvious problem in sh_fun, the main shell function execution function. It calls set_instance, but may siglongjmp before calling unset_instance. There is other state-restoring code that may fail to run as well, so who knows what other corner-case bugs that may have caused. Clearly, sh_fun should not siglongjmp before fully restoring state. This problem has been present all along; we didn't introduce it.

But that still didn't fix the crash for me. The crash turns out to be caused by an omission in our own robustification of discipline functions (see 430e47813cacf7366ea4ce03d170b254447c331c, 2322f939429ae002f92ea333a3bb6b149aad1431). We added a sigsetjmp to the functions responsible for executing shell discipline functions, assign and lookup, but failed to make them actually siglongjmp when necessary – and this is absolutely necessary when the sh_subfork call added in added in 430e47813cacf7366ea4ce03d170b254447c331c causes the parent to siglongjmp! So, this patch adds the missing siglongjmp calls. Since sh_subfork siglongjmps with SH_JMPSUB, check for jmpval >= SH_JMPSUB. And, as with the sh_fun fix, we make sure to restore all the state first.

Again, after this patch, the reproducer still leaves $_ broken after running the minimal reproducer, but the crash is fixed.

Patch with half the fix

```diff diff --git a/src/cmd/ksh93/sh/nvdisc.c b/src/cmd/ksh93/sh/nvdisc.c index 4fc45380f..c38eecd0e 100644 --- a/src/cmd/ksh93/sh/nvdisc.c +++ b/src/cmd/ksh93/sh/nvdisc.c @@ -245,6 +245,7 @@ static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) Namval_t node; union Value *up = np->nvalue.up; Namval_t *tp, *nr; /* for 'typeset -T' types */ + int jmpval = 0; if(val && (tp=nv_type(np)) && (nr=nv_open(val,sh.var_tree,NV_VARNAME|NV_ARRAY|NV_NOADD|NV_NOFAIL)) && tp==nv_type(nr)) { char *sub = nv_getsub(np); @@ -280,7 +281,6 @@ static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) if(nq && !isblocked(bp,type)) { struct checkpt checkpoint; - int jmpval; int savexit = sh.savexit; Lex_t *lexp = (Lex_t*)sh.lex_context, savelex; int bflag; @@ -370,6 +370,8 @@ done: nq->nvalue.rp->running=0; _nv_unset(nq,0); } + if(jmpval >= SH_JMPSUB) + siglongjmp(*sh.jmplist,jmpval); } /* @@ -384,10 +386,10 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) char *cp=0; Namval_t node; union Value *up = np->nvalue.up; + int jmpval = 0; if(nq && !isblocked(bp,type)) { struct checkpt checkpoint; - int jmpval; int savexit = sh.savexit; Lex_t *lexp = (Lex_t*)sh.lex_context, savelex; /* disciplines like PS2 may run at parse time; save, reinit and restore the lexer state */ @@ -449,6 +451,8 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) nq->nvalue.rp->running=0; _nv_unset(nq,0); } + if(jmpval >= SH_JMPSUB) + siglongjmp(*sh.jmplist,jmpval); return cp; } diff --git a/src/cmd/ksh93/sh/xec.c b/src/cmd/ksh93/sh/xec.c index da9245d47..8bae45040 100644 --- a/src/cmd/ksh93/sh/xec.c +++ b/src/cmd/ksh93/sh/xec.c @@ -3224,6 +3224,7 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) char *prefix = sh.prefix; int n=0; char *av[3]; + int jmpval = 0; Fcin_t save; fcsave(&save); if((offset=staktell())>0) @@ -3241,7 +3242,6 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) mode = set_instance(nq,&node, &nr); if(is_abuiltin(np)) { - int jmpval; struct checkpt *buffp = (struct checkpt*)stkalloc(sh.stk,sizeof(struct checkpt)); Shbltin_t *bp = &sh.bltindata; sh_pushcontext(buffp,SH_JMPCMD); @@ -3258,8 +3258,6 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) sh.exitval = (funptr(np))(n,argv,bp); } sh_popcontext(buffp); - if(jmpval>SH_JMPCMD) - siglongjmp(*sh.jmplist,jmpval); } else sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); @@ -3269,6 +3267,8 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) if(offset>0) stakset(base,offset); sh.prefix = prefix; + if(jmpval > SH_JMPCMD) + siglongjmp(*sh.jmplist,jmpval); return sh.exitval; } ```

McDutchie commented 1 year ago

......

   3266   }                                                                  
   3267   else                                                               
   3268     sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT));    <=== Here problem with sh_function do longjmp             
   3269   if(nq)                                                +=== to setjmp No1 so we never restore L_ARGNOD
   3270     unset_instance(nq, &node, &nr, mode);       <==== restore L_ARGNOD    NEVER reached if the above longjmp           
   3271   fcrestore(&save);

Great analysis there, thank you! You've made me realise my patch above still did not fix all of the problem. The sigsetjmp/siglongjmp logic in sh_fun() is still wrong. That sh_funct() call should be fully included in it.

This new version of the patch does that... and... looks like the bug is FIXED!! :-)

```diff diff --git a/src/cmd/ksh93/sh/nvdisc.c b/src/cmd/ksh93/sh/nvdisc.c index 4fc45380f..c38eecd0e 100644 --- a/src/cmd/ksh93/sh/nvdisc.c +++ b/src/cmd/ksh93/sh/nvdisc.c @@ -245,6 +245,7 @@ static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) Namval_t node; union Value *up = np->nvalue.up; Namval_t *tp, *nr; /* for 'typeset -T' types */ + int jmpval = 0; if(val && (tp=nv_type(np)) && (nr=nv_open(val,sh.var_tree,NV_VARNAME|NV_ARRAY|NV_NOADD|NV_NOFAIL)) && tp==nv_type(nr)) { char *sub = nv_getsub(np); @@ -280,7 +281,6 @@ static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) if(nq && !isblocked(bp,type)) { struct checkpt checkpoint; - int jmpval; int savexit = sh.savexit; Lex_t *lexp = (Lex_t*)sh.lex_context, savelex; int bflag; @@ -370,6 +370,8 @@ done: nq->nvalue.rp->running=0; _nv_unset(nq,0); } + if(jmpval >= SH_JMPSUB) + siglongjmp(*sh.jmplist,jmpval); } /* @@ -384,10 +386,10 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) char *cp=0; Namval_t node; union Value *up = np->nvalue.up; + int jmpval = 0; if(nq && !isblocked(bp,type)) { struct checkpt checkpoint; - int jmpval; int savexit = sh.savexit; Lex_t *lexp = (Lex_t*)sh.lex_context, savelex; /* disciplines like PS2 may run at parse time; save, reinit and restore the lexer state */ @@ -449,6 +451,8 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) nq->nvalue.rp->running=0; _nv_unset(nq,0); } + if(jmpval >= SH_JMPSUB) + siglongjmp(*sh.jmplist,jmpval); return cp; } diff --git a/src/cmd/ksh93/sh/xec.c b/src/cmd/ksh93/sh/xec.c index da9245d47..4024dff60 100644 --- a/src/cmd/ksh93/sh/xec.c +++ b/src/cmd/ksh93/sh/xec.c @@ -3224,6 +3224,8 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) char *prefix = sh.prefix; int n=0; char *av[3]; + int jmpval = 0; + struct checkpt *buffp = (struct checkpt*)stkalloc(sh.stk,sizeof(struct checkpt)); Fcin_t save; fcsave(&save); if((offset=staktell())>0) @@ -3239,15 +3241,13 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) n++; if(nq) mode = set_instance(nq,&node, &nr); - if(is_abuiltin(np)) + sh_pushcontext(buffp,SH_JMPCMD); + jmpval = sigsetjmp(buffp->buff,1); + if(jmpval == 0) { - int jmpval; - struct checkpt *buffp = (struct checkpt*)stkalloc(sh.stk,sizeof(struct checkpt)); - Shbltin_t *bp = &sh.bltindata; - sh_pushcontext(buffp,SH_JMPCMD); - jmpval = sigsetjmp(buffp->buff,1); - if(jmpval == 0) + if(is_abuiltin(np)) { + Shbltin_t *bp = &sh.bltindata; bp->bnode = np; bp->ptr = nv_context(np); errorpush(&buffp->err,0); @@ -3257,18 +3257,18 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) sh.exitval = 0; sh.exitval = (funptr(np))(n,argv,bp); } - sh_popcontext(buffp); - if(jmpval>SH_JMPCMD) - siglongjmp(*sh.jmplist,jmpval); + else + sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); } - else - sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); + sh_popcontext(buffp); if(nq) unset_instance(nq, &node, &nr, mode); fcrestore(&save); if(offset>0) stakset(base,offset); sh.prefix = prefix; + if(jmpval > SH_JMPCMD) + siglongjmp(*sh.jmplist,jmpval); return sh.exitval; } ```

McDutchie commented 1 year ago

Well, it did look like it was fixed properly, but there are a few regression test failures with the latest version of the patch. I hope we can find a way to fix those, because the patch is definitely progress.

test comvar begins at 2023-04-03+00:03:23
    comvar.sh[243]: FAIL: should be b.x=456
test comvar failed at 2023-04-03+00:03:23 with exit code 1 [ 103 tests 1 error ]
test functions begins at 2023-04-03+00:03:28
    functions.sh[1480]: FAIL: funcname.ksh crash (direct run) (expected $'f\ng-f\nf-g-f\ng-f-g-f\nf-g-f-g-f\ng-f-g-f-g-f\nf-g-f-g-f-g-f\ng-f-g-f-g-f-g-f\nf-g-f-g-f-g-f-g-f\ng-f-g-f-g-f-g-f-g-f\nf-g-f-g-f-g-f-g-f-g-f\ng-f-g-f-g-f-g-f-g-f-\nf-g-f-g-f-g-f-g-f--\ng-f-g-f-g-f-g-f---\nf-g-f-g-f-g-f----\ng-f-g-f-g-f-----\nf-g-f-g-f------\ng-f-g-f-------\nf-g-f--------\ng-f---------\nf----------', got $'f\ni`,\x01\ne`,\x01\ni`,\x01\ne`,\x01\ni`,\x01\ne`,\x01\ni`,\x01\ne`,\x01\ni`,\x01\ng`,\x01\nk`,\x01\ng`,\x01\nk`,\x01\ng`,\x01\nk`,\x01\ng`,\x01\nk`,\x01\ng`,\x01\nk`,\x01\ng`,\x01')
    functions.sh[1505]: FAIL: funcname.ksh crash (dot) (expected $'f-\ng-f-\nf-g-f-\ng-f-g-f-\nf-g-f-g-f-\ng-f-g-f-g-f-\nf-g-f-g-f-g-f-\ng-f-g-f-g-f-g-f-\nf-g-f-g-f-g-f-g-f-\ng-f-g-f-g-f-g-f-g-f-\nf-g-f-g-f-g-f-g-f-g-f-\ng-f-g-f-g-f-g-f-g-f--\nf-g-f-g-f-g-f-g-f---\ng-f-g-f-g-f-g-f----\nf-g-f-g-f-g-f-----\ng-f-g-f-g-f------\nf-g-f-g-f-------\ng-f-g-f--------\nf-g-f---------\ng-f----------\nf-----------', got $'c`I\x01\ng`I\x01\nc`I\x01\ng`I\x01\nc`I\x01\ng`I\x01\nc`I\x01\ng`I\x01\nc`I\x01\ng`I\x01\nd`I\x01\nh`I\x01\nd`I\x01\nh`I\x01\nd`I\x01\nh`I\x01\nd`I\x01\nh`I\x01\nd`I\x01\nh`I\x01\nd`I\x01')
test functions failed at 2023-04-03+00:03:30 with exit code 2 [ 132 tests 2 errors ]

McDutchie commented 1 year ago

I found the cause of the regressions with the latest patch. It doesn't like it when the checkpoint buffer is allocated onto the AST stack with stkalloc when calling sh_funct -- some functions use code paths that also use the stack, evidently in an interfering way. When we allocate that buffer on the heap instead, those regressions disappear, and everything seems to be fixed this time! :-)

Three times is a charm?

```diff diff --git a/src/cmd/ksh93/sh/nvdisc.c b/src/cmd/ksh93/sh/nvdisc.c index 4fc45380f..c38eecd0e 100644 --- a/src/cmd/ksh93/sh/nvdisc.c +++ b/src/cmd/ksh93/sh/nvdisc.c @@ -245,6 +245,7 @@ static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) Namval_t node; union Value *up = np->nvalue.up; Namval_t *tp, *nr; /* for 'typeset -T' types */ + int jmpval = 0; if(val && (tp=nv_type(np)) && (nr=nv_open(val,sh.var_tree,NV_VARNAME|NV_ARRAY|NV_NOADD|NV_NOFAIL)) && tp==nv_type(nr)) { char *sub = nv_getsub(np); @@ -280,7 +281,6 @@ static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) if(nq && !isblocked(bp,type)) { struct checkpt checkpoint; - int jmpval; int savexit = sh.savexit; Lex_t *lexp = (Lex_t*)sh.lex_context, savelex; int bflag; @@ -370,6 +370,8 @@ done: nq->nvalue.rp->running=0; _nv_unset(nq,0); } + if(jmpval >= SH_JMPSUB) + siglongjmp(*sh.jmplist,jmpval); } /* @@ -384,10 +386,10 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) char *cp=0; Namval_t node; union Value *up = np->nvalue.up; + int jmpval = 0; if(nq && !isblocked(bp,type)) { struct checkpt checkpoint; - int jmpval; int savexit = sh.savexit; Lex_t *lexp = (Lex_t*)sh.lex_context, savelex; /* disciplines like PS2 may run at parse time; save, reinit and restore the lexer state */ @@ -449,6 +451,8 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) nq->nvalue.rp->running=0; _nv_unset(nq,0); } + if(jmpval >= SH_JMPSUB) + siglongjmp(*sh.jmplist,jmpval); return cp; } diff --git a/src/cmd/ksh93/sh/xec.c b/src/cmd/ksh93/sh/xec.c index da9245d47..e7fda8ff5 100644 --- a/src/cmd/ksh93/sh/xec.c +++ b/src/cmd/ksh93/sh/xec.c @@ -3224,6 +3224,10 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) char *prefix = sh.prefix; int n=0; char *av[3]; + int jmpval = 0; + /* allocate checkpoint on heap, not AST stack, to avoid + * possible conflicting use of the stack via sh_funct() */ + struct checkpt *buffp = sh_malloc(sizeof(struct checkpt)); Fcin_t save; fcsave(&save); if((offset=staktell())>0) @@ -3239,15 +3243,13 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) n++; if(nq) mode = set_instance(nq,&node, &nr); - if(is_abuiltin(np)) + sh_pushcontext(buffp,SH_JMPCMD); + jmpval = sigsetjmp(buffp->buff,1); + if(jmpval == 0) { - int jmpval; - struct checkpt *buffp = (struct checkpt*)stkalloc(sh.stk,sizeof(struct checkpt)); - Shbltin_t *bp = &sh.bltindata; - sh_pushcontext(buffp,SH_JMPCMD); - jmpval = sigsetjmp(buffp->buff,1); - if(jmpval == 0) + if(is_abuiltin(np)) { + Shbltin_t *bp = &sh.bltindata; bp->bnode = np; bp->ptr = nv_context(np); errorpush(&buffp->err,0); @@ -3257,18 +3259,19 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) sh.exitval = 0; sh.exitval = (funptr(np))(n,argv,bp); } - sh_popcontext(buffp); - if(jmpval>SH_JMPCMD) - siglongjmp(*sh.jmplist,jmpval); + else + sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); } - else - sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); + sh_popcontext(buffp); + free(buffp); if(nq) unset_instance(nq, &node, &nr, mode); fcrestore(&save); if(offset>0) stakset(base,offset); sh.prefix = prefix; + if(jmpval > SH_JMPCMD) + siglongjmp(*sh.jmplist,jmpval); return sh.exitval; } ```

McDutchie commented 1 year ago

So, please let me know if you can break the latest patch…

McDutchie commented 1 year ago

I found the cause of why my second-to-last patch didn't play well with the stack in sh_fun(). The stack state was being saved (base=stakfreeze(0);) after allocating the checkpoint buffer on the stack. It should be saved before, otherwise it is going to be restored wrong. My mistake.

The previous patch appears to work perfectly, but using the heap is less efficient than using the stack (as the stack allocates memory in larger chunks and is periodically auto-freed), so it's worth making the stack version work correctly.

Patch version four (edit: broken; removed)

McDutchie commented 1 year ago

Patch version four suddenly started crashing for me. Version three is still okay. Urgh. I'm going to sleep on it before debugging further.

McDutchie commented 1 year ago

Oh, found the cause already. I was being too smart. Patch version four contains

        sh_pushcontext(checkpoint = stakalloc(sizeof(struct checkpt)), SH_JMPCMD);

and that looks like it should work, but sh_pushcontext is a macro and not a function, and the first argument gets evaluated multiple times so that the checkpoint buffer gets allocated multiple times. Oops.

Patch version five

```diff diff --git a/src/cmd/ksh93/sh/nvdisc.c b/src/cmd/ksh93/sh/nvdisc.c index 4fc45380f..c38eecd0e 100644 --- a/src/cmd/ksh93/sh/nvdisc.c +++ b/src/cmd/ksh93/sh/nvdisc.c @@ -245,6 +245,7 @@ static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) Namval_t node; union Value *up = np->nvalue.up; Namval_t *tp, *nr; /* for 'typeset -T' types */ + int jmpval = 0; if(val && (tp=nv_type(np)) && (nr=nv_open(val,sh.var_tree,NV_VARNAME|NV_ARRAY|NV_NOADD|NV_NOFAIL)) && tp==nv_type(nr)) { char *sub = nv_getsub(np); @@ -280,7 +281,6 @@ static void assign(Namval_t *np,const char* val,int flags,Namfun_t *handle) if(nq && !isblocked(bp,type)) { struct checkpt checkpoint; - int jmpval; int savexit = sh.savexit; Lex_t *lexp = (Lex_t*)sh.lex_context, savelex; int bflag; @@ -370,6 +370,8 @@ done: nq->nvalue.rp->running=0; _nv_unset(nq,0); } + if(jmpval >= SH_JMPSUB) + siglongjmp(*sh.jmplist,jmpval); } /* @@ -384,10 +386,10 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) char *cp=0; Namval_t node; union Value *up = np->nvalue.up; + int jmpval = 0; if(nq && !isblocked(bp,type)) { struct checkpt checkpoint; - int jmpval; int savexit = sh.savexit; Lex_t *lexp = (Lex_t*)sh.lex_context, savelex; /* disciplines like PS2 may run at parse time; save, reinit and restore the lexer state */ @@ -449,6 +451,8 @@ static char* lookup(Namval_t *np, int type, Sfdouble_t *dp,Namfun_t *handle) nq->nvalue.rp->running=0; _nv_unset(nq,0); } + if(jmpval >= SH_JMPSUB) + siglongjmp(*sh.jmplist,jmpval); return cp; } diff --git a/src/cmd/ksh93/sh/xec.c b/src/cmd/ksh93/sh/xec.c index da9245d47..b93a4f1b6 100644 --- a/src/cmd/ksh93/sh/xec.c +++ b/src/cmd/ksh93/sh/xec.c @@ -3210,21 +3210,23 @@ static void sh_funct(Namval_t *np,int argn, char *argv[],struct argnod *envlist, } /* - * external interface to execute a function without arguments + * external interface to execute a function * is the function node * If is not-null, then sh.name and sh.subscript will be set */ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) { - int offset = 0; - char *base; - Namval_t node; + struct checkpt *checkpoint; + int jmpval = 0; + int offset = 0; + char *base; + Namval_t node; struct Namref nr; long mode = 0; char *prefix = sh.prefix; - int n=0; - char *av[3]; - Fcin_t save; + int n=0; + char *av[3]; + Fcin_t save; fcsave(&save); if((offset=staktell())>0) base=stakfreeze(0); @@ -3239,36 +3241,35 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[]) n++; if(nq) mode = set_instance(nq,&node, &nr); - if(is_abuiltin(np)) + checkpoint = stakalloc(sizeof(struct checkpt)); + sh_pushcontext(checkpoint, SH_JMPCMD); + jmpval = sigsetjmp(checkpoint->buff,1); + if(jmpval == 0) { - int jmpval; - struct checkpt *buffp = (struct checkpt*)stkalloc(sh.stk,sizeof(struct checkpt)); - Shbltin_t *bp = &sh.bltindata; - sh_pushcontext(buffp,SH_JMPCMD); - jmpval = sigsetjmp(buffp->buff,1); - if(jmpval == 0) + if(is_abuiltin(np)) { + Shbltin_t *bp = &sh.bltindata; bp->bnode = np; bp->ptr = nv_context(np); - errorpush(&buffp->err,0); + errorpush(&checkpoint->err,0); error_info.id = argv[0]; opt_info.index = opt_info.offset = 0; opt_info.disc = 0; sh.exitval = 0; sh.exitval = (funptr(np))(n,argv,bp); } - sh_popcontext(buffp); - if(jmpval>SH_JMPCMD) - siglongjmp(*sh.jmplist,jmpval); + else + sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); } - else - sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); + sh_popcontext(checkpoint); if(nq) unset_instance(nq, &node, &nr, mode); fcrestore(&save); if(offset>0) stakset(base,offset); sh.prefix = prefix; + if(jmpval >= SH_JMPFUN) + siglongjmp(*sh.jmplist,jmpval); return sh.exitval; } ```

phidebian commented 1 year ago

Jeez you did all this while I was sleeping :) lemme review it, it may take a while...

Ho BTW how do you insert the git diff with highlighted text in a reply ? Thx in advance.

phidebian commented 1 year ago

Ha ok I ramp up my knowledge on the struct checkpt usage. So sh_fun()-->sh_fuct()->..->sh_funscope()->..->sh_fork() etc... can do a longjmp anytime (I cite the call path from memory so the ...) then no function in the path should do code like this

save something
call inner function  // longjmp into whatever callers may have setup
restore something // escaped restore due to longjmp

The inner function may escape the restore except if the restore things is stuffed into a 'struct chkptanpushcontext()is done, and the restore done by thepopcontext()/errorpop()`. But then it make code review hard to follow.

May be a fix could be to add a layer of save(np)/pushcontext/setjmp in sh_fun() and then catch longjmp/resore(np)/forward longjmp to prev layer of context... dunno if that make sense though :-) OR add a save(np)/restore(np) in any caller to sh_fun() there are not that much and remove the save/restore in sh_func()

Now that I look the fault.h I see the macro's sh_pushcontext() and sh_popcontext() it raise me a general question, why thoses things are not static inline void sh_pushcontext(...{...} is that because ksh can be built with 'anciant' compilers unable to inline ? even so, the API (a void return and 2 regs args) make the static function call pretty cheap, even on the perf path. To me thoses macro's are drama waiting to happen :-)

EDIT: I meant, not doing a complete refactoring, well may be one is underway for C99, C23 anyway, but I meant more fixing macro to static inline when make sense as we go fixing.

phidebian commented 1 year ago

patch-five runs like a champ :) you are a star!

McDutchie commented 1 year ago

Ho BTW how do you insert the git diff with highlighted text in a reply ?

Start with ```diff, end with ```. You can also start with ```sh, ```c, etc.

May be a fix could be to add a layer of save(np)/pushcontext/setjmp in sh_fun() and then catch longjmp/resore(np)/forward longjmp to prev layer of context... dunno if that make sense though :-)

Yeah, that is the effect of my patch for the offending sh_funct() call.

Now that I look the fault.h I see the macro's sh_pushcontext() and sh_popcontext() it raise me a general question, why thoses things are not static inline void sh_pushcontext(...{...}

The code base we forked was still trying to be compatible with both K&R C and C++ (but badly failing at it due to bit rot). We've abandoned that but still use C89/C90 as a lowest common denominator (see a34e83192b62493eb0e08874c855c386c1be6615, 1064933eec11591a6a7057a19b094209ed1f7859, 427ea547c72f668722fb30d3d89e7c155d6929c7, and other commits referencing those). So we can still build with gcc 2.95.3, for example. I don't want to modernise beyond that yet, as I like to occasionally test ksh on ancient and obscure systems; they sometimes expose bugs others just quietly tolerate...

Looks like the inline keyword was introduced in C99, so we cannot use it except subject to a feature test. We already use a couple of modern optimisation features (noreturn and __builtin_unreachable()) as we added iffe feature tests for them; see src/lib/libast/features/common. If inline is defined as an empty macro on compilers that don't have it, any functions declared as inline become regular functions.

Or I could just tweak the macros to avoid the multiple evaluations. I'm pretty sure this particular pair of macros is only ever called as if they were void functions, so we should be able to use { } instead of ( ) and declare a local variable to which to assign the first macro parameter, evaluating it just once.

To me thoses macro's are drama waiting to happen :-)

Welcome to this codebase, you'll find many other things like that (and there were many other such things you won't find as we've cleaned them up already) :P

Like, the multibyte character getter macro expansion mbchar(cp) will increase the char pointer cp by the byte length of the multibyte character found, though you can't tell from the call itself. That really threw me as I was getting to know the code.

Honestly though, it's not too bad as long as you make sure you stay aware of this issue.

I've learned to make intensive use of Exuberant ctags to immediately jump to the declarations of things and take a quick look at them before doing anything even a little smart. Highly recommended for getting to know the code. You do need an editor that supports ctags, but the major ones (vi, emacs) do and so does my own favourite, joe. I spent the first couple of years awkwardly grepping the code to find things (becoming quite adept at regexing this code…) and I wish I'd discovered ctags much sooner. To create/update the tags file, I use: ctags -R --if0 arch bin src (the --if0 takes advantage of the #if 0 dummy declarations in src/cmd/ksh93/bltins that are commented as being "for the dictionary generator").

patch-five runs like a champ :) you are a star!

Neat. Thanks for testing and analysing! You really helped, I'm not sure I'd have found the sh_fun() breakage without your analysis.

phidebian commented 1 year ago

Ha OK got it, the good old system testing, I used to do just that on my old projects, testing BE and perf on oldies, I had so slow HW at the time, I was tuning the perf like crazy, for the benefit of the latest/greatest HW.

I'll keep the gcc 2.95.3 in mind then.

I personally use cscope(1), an old habit, even though I am using emacs as editor... my .emacs is as old as my .kshrc (ksh88) :-)

Thanx for the ```lang MD thingy will play with that a bit :)

phidebian commented 1 year ago

Forgot to say, patch-five run like a champ on ubuntu 22.04 s390x as well.

hyenias commented 1 year ago

Tested successful as described in issue with patch 5. I also checked my memory fault notes and did not find anything else to add to the mix.

JohnoKing commented 1 year ago

FWIW, I have also tested patch 5 and haven't found any major breakage (i.e., no crashes), although I did encounter a compiler warning:

/home/johno/GitRepos/KornShell/ksh/src/cmd/ksh93/sh/xec.c: In function 'sh_fun':
/home/johno/GitRepos/KornShell/ksh/src/cmd/ksh93/sh/xec.c:3244:20: warning: assignment to 'struct checkpt *' from incompatible pointer type 'char *' [-Wincompatible-pointer-types]
 3244 |         checkpoint = stakalloc(sizeof(struct checkpt));
      |                    ^

This is fairly easy to fix (apply on top of patch 5):

--- a/src/cmd/ksh93/sh/xec.c
+++ b/src/cmd/ksh93/sh/xec.c
@@ -3241,7 +3241,7 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[])
        n++;
    if(nq)
        mode = set_instance(nq,&node, &nr);
-   checkpoint = stakalloc(sizeof(struct checkpt));
+   checkpoint = (struct checkpt*)stakalloc(sizeof(struct checkpt));
    sh_pushcontext(checkpoint, SH_JMPCMD);
    jmpval = sigsetjmp(checkpoint->buff,1);
    if(jmpval == 0)

McDutchie commented 1 year ago

Thanks, good point. I should really fix the stak/stk interface to use void pointers.

ksh93 / ksh

Crash when running `set` in a subshell/comsub with discipline function defined #616