Closed McDutchie closed 1 year ago
ASan stack trace (not that I can make head or tail of what the heck is going on):
$ arch/darwin.arm64-64/bin/ksh
$ . ./issue616.sh
=================================================================
==56612==ERROR: AddressSanitizer: heap-use-after-free on address 0x000106f03880 at pc 0x00010304eda0 bp 0x00016cf8ccd0 sp 0x00016cf8ccc8
READ of size 1 at 0x000106f03880 thread T0
#0 0x10304ed9c in sfvprintf sfvprintf.c:717
#1 0x103034af0 in sfprintf sfprintf.c:31
#2 0x102f26390 in nv_getval name.c:2663
#3 0x102e95b18 in print_namval typeset.c:1528
#4 0x102e9408c in print_scan typeset.c:1631
#5 0x102e9447c in b_set typeset.c:1267
#6 0x102f5e620 in sh_exec xec.c:1261
#7 0x102f51844 in sh_subshell subshell.c:651
#8 0x102f0bac0 in comsubst macro.c:2238
#9 0x102f0d624 in varsub macro.c:1203
#10 0x102f06f48 in copyto macro.c:623
#11 0x102f0591c in sh_mactrim macro.c:171
#12 0x102f1913c in nv_setlist name.c:277
#13 0x102f5ca58 in sh_exec xec.c:1059
#14 0x102f58830 in sh_exec xec.c:1944
#15 0x102f577bc in sh_eval xec.c:666
#16 0x102e7aaf4 in b_dot_cmd misc.c:317
#17 0x102f5e620 in sh_exec xec.c:1261
#18 0x102e9c410 in exfile main.c:608
#19 0x102e9a0a8 in sh_main main.c:371
#20 0x102e6f83c in main pmain.c:42
#21 0x10362d088 in start+0x204 (dyld:arm64e+0x5088)
0x000106f03880 is located 0 bytes inside of 869-byte region [0x000106f03880,0x000106f03be5)
freed by thread T0 here:
#0 0x103a4ade4 in wrap_free+0x98 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3ede4)
#1 0x102fd3990 in stkcopy stk.c:479
#2 0x102ea32dc in lookup nvdisc.c:426
#3 0x102e9ffb0 in lookups nvdisc.c:457
#4 0x102e9d970 in nv_getv nvdisc.c:59
#5 0x102f26764 in nv_getval name.c:2673
#6 0x102e95b18 in print_namval typeset.c:1528
#7 0x102e9408c in print_scan typeset.c:1631
#8 0x102e9447c in b_set typeset.c:1267
#9 0x102f5e620 in sh_exec xec.c:1261
#10 0x102f51844 in sh_subshell subshell.c:651
#11 0x102f0bac0 in comsubst macro.c:2238
#12 0x102f0d624 in varsub macro.c:1203
#13 0x102f06f48 in copyto macro.c:623
#14 0x102f0591c in sh_mactrim macro.c:171
#15 0x102f1913c in nv_setlist name.c:277
#16 0x102f5ca58 in sh_exec xec.c:1059
#17 0x102f58830 in sh_exec xec.c:1944
#18 0x102f577bc in sh_eval xec.c:666
#19 0x102e7aaf4 in b_dot_cmd misc.c:317
#20 0x102f5e620 in sh_exec xec.c:1261
#21 0x102e9c410 in exfile main.c:608
#22 0x102e9a0a8 in sh_main main.c:371
#23 0x102e6f83c in main pmain.c:42
#24 0x10362d088 in start+0x204 (dyld:arm64e+0x5088)
previously allocated by thread T0 here:
#0 0x103a4aca8 in wrap_malloc+0x94 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3eca8)
#1 0x102fd37ec in stkcopy stk.c:452
#2 0x102ea32dc in lookup nvdisc.c:426
#3 0x102e9ffb0 in lookups nvdisc.c:457
#4 0x102e9d970 in nv_getv nvdisc.c:59
#5 0x102f26764 in nv_getval name.c:2673
#6 0x102e95b18 in print_namval typeset.c:1528
#7 0x102e9408c in print_scan typeset.c:1631
#8 0x102e9447c in b_set typeset.c:1267
#9 0x102f5e620 in sh_exec xec.c:1261
#10 0x102f51844 in sh_subshell subshell.c:651
#11 0x102f0bac0 in comsubst macro.c:2238
#12 0x102f0d624 in varsub macro.c:1203
#13 0x102f06f48 in copyto macro.c:623
#14 0x102f0591c in sh_mactrim macro.c:171
#15 0x102f1913c in nv_setlist name.c:277
#16 0x102f5ca58 in sh_exec xec.c:1059
#17 0x102f58830 in sh_exec xec.c:1944
#18 0x102f577bc in sh_eval xec.c:666
#19 0x102e7aaf4 in b_dot_cmd misc.c:317
#20 0x102f5e620 in sh_exec xec.c:1261
#21 0x102e9c410 in exfile main.c:608
#22 0x102e9a0a8 in sh_main main.c:371
#23 0x102e6f83c in main pmain.c:42
#24 0x10362d088 in start+0x204 (dyld:arm64e+0x5088)
SUMMARY: AddressSanitizer: heap-use-after-free sfvprintf.c:717 in sfvprintf
Shadow bytes around the buggy address:
0x007020e006c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x007020e006d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x007020e006e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x007020e006f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x007020e00700: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x007020e00710:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x007020e00720: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x007020e00730: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x007020e00740: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x007020e00750: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x007020e00760: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==56612==ABORTING
Abort
That's... weird.
I put it in a .sh, run it, the subshell doesn't crash, v
gets set, all is happy
I add GIT_BRANCH.get to my shell startup, and v=$(set)
crashes the shell.
I have two other .get disciplines and one .set discipline in my normal environment, and v=$(set)
doesn't crash the shell. Just this .get discipline crashes it.
dev@754234d7 here.
If I run it ksh ./git_branch.ksh
, the shell doesn't crash. If I source it . ./git_branch.ksh
, the shell crashes.
My head hurts.
I have this in my normal environment
function _git_status.get
{
typeset branch commit return
branch=$(git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/[\1]/')
[[ -n "$branch" ]] && commit=$(git rev-parse --short HEAD 2> /dev/null) && return="${branch%]}@${commit}]"
.sh.value=${return}
}
and v=$(set)
doesn't crash.
The problem seems to be the use of the stack (the stkcopy() call in nvdisc.c:426). That may not be compatible with virtual subshells.
If we replace cp = stkcopy(stkstd,cp);
on that line by cp = sh_strdup(cp);
to use the heap instead, the crash disappears, but of course a memory leak is left in its place.
edit: No, it doesn't. It just becomes intermittent, at least on my system.
My head hurts.
These crashes are due to "undefined behaviour" which is notoriously unpredictable.
These crashes are due to "undefined behaviour" which is notoriously unpredictable.
But this discipline function crashes it and the two I normally have in there don't?
In my testing here so far, I only get the fault if git branch 2>/dev/null
actually returns something, i.e. you're actually in a repository.
Something weird I noticed. I replaced my _git_status.get
with your git_branch.get
(renaming the variable to _git_status
), and the presentation in the environment is different. If it's null, then you see something named .sh._git_status
in the environment, and if it's non-null, then you see _git_status
in the environment. With my original _git_status.get
, the variable in the environment is always named _git_status
, whether it's null or not.
Sure my head hurt too.
To me it all comes from macro.c::comsubst()
if(type==2 && sh.subshell && !sh.subshare)
sh_subfork(); /* subshares within virtual subshells are broken, so fork first */
sp = sh_subshell(t,sh_isstate(SH_ERREXIT),type);
Looks like comsubst() is a trouble zone and in trouble zone, trouble are found (happy silicon valley).
To fix this I need explanations from guru's out there.
To me all this jazz can be streamlined to
function Z.get { .sh.value=${ /bin/echo yo ; } ; }
echo "$(set)"
macro.c::varsub()
/*
* This routine handles $param, ${param}, and ${param op word}
* The input stream is assumed to be a string
*/
Apparently we enter here with $(...)
too, may be this is just a comment glitch, may be archelogist can figure out that $(...)
may have been added late...
macro.c::comsubst()
/*
* This routine handles command substitution
* <type> is 0 for older `...` version
* 1 for $(...) or 2 for ${ subshare; }
*/
static void comsubst(Mac_t *mp,Shnode_t* t, int type)
Jeez we enter here with type=3 and no explanation anywere what 3 could mean.
but sure enough at line 2236
or so the infamous
if(type==2 && sh.subshell && !sh.subshare)
sh_subfork(); /* subshares within virtual subshells are broken, so fork first */
sp = sh_subshell(t,sh_isstate(SH_ERREXIT),type);
Since we enter with 3 we go the sh_subshell
way on the first comsubst()
i.e $(set)
this any leads to print_scan()
leads to print_namval()
this one find the kind of $Z
with a ${...}
in there i.e '${ /bin/echo yo ; } ` this is yet another comsubst() that seem not reentrant regarding subshare, then kaboom.
I made a 'patch' that is not a fix but a demonstrator of all this explanation.
above the
sh_subfork(); /* subshares within virtual subshells are broken, so fork first */
I add this
if(t->tre.tretyp==0)
{ char *p=save.fcbuff;
while(*p&&*p!='$')p++;
if(*p && p[1]=='(')p+=2;
while(*p&&(*p==' ')||(*p=='\t'))p++;
if(*p&&p[0]=='s'&&p[1]=='e'&&p[2]=='t')p+=3;
while(*p&&(*p==' ')||(*p=='\t'))p++;
if(*p==')')
{ type=2;
}
}
I know looks ugly, don't punch me on this, this is just a demonstrator kludge.
if(t->tre.tretyp==0)
it is a COM, then scan it for a
...=$(...set...[^-]) if so got the
type=2way i.e fork() for good. If we got $(set -...) we got the
type=3way as before as options are not going
printf_scan()`.
We got to do it on 1st comsubst() occurence, not the second one where it is too late, we are hosed already.
The patch https://github.com/phidebian/ksh/tree/bug-616
Enlight me about this type=3
thing
Cheers,
I think the type==3 thing is a red herring.
It was introduced by backporting a couple of Red Hat patches, here: 970069a6feb71424f4a98b9f3005181eeaa1c448
I radically simplified the amazing mess of hacks that were the multiple ksh command substitutions in 42becab63cae760addae6e88008665be352bdbbd (note: the "type" referred to there is not the same as the type parameter passed to comsubst()
), 7c3868b8bdba9e276ddadf8e9df81f5872631f8b and bbcadf7020bc0246f5288b78fea270b30df4bda5.
Right now my head is hurting as well and I'm not clear that all the current comments about what the types are, are correct.
But I'm pretty sure that type==3
passed to comsubst()
is now treated exactly the same as type==1
and I simply missed that in my efforts to clean that mess up. The following patch gets rid of it and does not cause any regression test failures on my end. (It doesn't fix the present crashing bug either, though.)
diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index 4513be322..11c37dd37 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -377,7 +377,7 @@ void sh_machere(Sfio_t *infile, Sfio_t *outfile, char *string)
break;
}
case S_PAR:
- comsubst(mp,NULL,3);
+ comsubst(mp,NULL,1);
break;
case S_EOF:
if((c=fcfill()) > 0)
@@ -1200,7 +1200,7 @@ retry1:
case S_PAR:
if(type)
goto nosub;
- comsubst(mp,NULL,3);
+ comsubst(mp,NULL,1);
return 1;
case S_DIG:
var = 0;
@@ -2229,7 +2229,6 @@ static void comsubst(Mac_t *mp,Shnode_t* t, int type)
char *cp = (char*)sh_malloc(IOBSIZE+1);
sp = sfnew(NULL,cp,IOBSIZE,fd,SF_READ|SF_MALLOC);
}
- type = 3;
}
else
{
Your demonstrator patch is a workaround that causes sh_subfork() to be called if the command in $(...) is set
. This workaround can be done in shell as well. If we add a ulimit
command to the command substitution in the last line in the reproducer, for example
v=$(ulimit -c 0; set)
then the bug goes away as any ulimit
invocation will force a virtual subshell to fork.
Perhaps we're forking at the wrong point in the code path for subshares within virtual subshells.
Actually, the forking of virtual subshells containing subshares is a red herring, too. If we remove it:
…then it crashes just the same.
It effectivly strange to do things like $(set)
where set
will list (printf_scan) and access variables variables (print_nameval) that are in the process to be modified (discipline), specially when printf_scan()
reach _
where it get its value, then _
is setup for discipline function, it is a be disorientating :) dunno if this important or not.
You right this kinda patch, simply avoid doing set in subshell, its a kludge to at least avoid this core dump.
I think the test about fork
vs subshell
is a problem, theorically speaking on the 1st occurence we don't know yet (beside the kludge sneaking in the fcbuff) if going subshell will be fatal, and on the second occurence (of comsubst) it is too late, so that's why I made this horrible thing as an headsup, to focus on this part.
I didn't knew this RH thing, interestig I like your historicall analysis :-)
I've found the following patch to be an effective workaround for this bug:
diff --git a/src/cmd/ksh93/bltins/typeset.c b/src/cmd/ksh93/bltins/typeset.c
index b5b0330d5..544d46110 100644
--- a/src/cmd/ksh93/bltins/typeset.c
+++ b/src/cmd/ksh93/bltins/typeset.c
@@ -1432,6 +1432,15 @@ static int print_namval(Sfio_t *file,Namval_t *np,int flag, struct tdata *tp)
int indent=tp->indent, outname=0, isfun;
char tempexport=0;
sh_sigcheck();
+ /*
+ * Printing a name-value pair can cause a discipline shell function to be executed
+ * as the value is obtained via nv_getval(). This can cause messy interactions with
+ * further levels of virtual subshell or subshare that may cause the shell to crash
+ * in certain corner cases. To be safe, fork a virtual subshell early.
+ * https://github.com/ksh93/ksh/issues/616
+ */
+ if(sh.subshell && !sh.subshare)
+ sh_subfork();
if(flag)
flag = '\n';
if(tp->noref && nv_isref(np))
I'm away from my laptop right now but looks appealing will test it tonight.:-)
I think we can make the forking workaround more fine-grained than that. It is only necessary if we need to call a shell discipline function to obtain the value. It is not needed for internal C discipline functions for special variables, like IFS.
There is currently no way to check if a variable has a shell discipline function to get the value. What is needed is to check if the getval
or getnum
discipline function pointers are set to lookups
or lookupn
. Those are the C functions that handle the shell .get
and .getn
disciplines. But they are static functions local to nvdisc.c, so we cannot currently access those names in typeset.c to check for them.
The patch below renames these for legibility (to sh_disc_getstring() and sh_disc_getnum(), respectively) and turns them into externs. For consistency, assign() is similarly renamed to sh_disc_assign(); we might want to check for this somewhere in future.
We can then loop through each variable's linked list of disciplines (algorithm copied from nv_hasdisc() in nvdisc.c) and check if a shell .get
or .getn
discipline was set. In that case, we fork as a workaround.
Of course, it would be even better if a way could be found to fix the damn heap-use-after-free...
Trying to write a good regression test. This crash is such a heisenbug, I have not found a way to reproduce it in a regression test yet. Seems like this crash depends on something in my .kshrc.
Minimal reproducer:
GIT_BRANCH.get()
{
.sh.value=${ echo foo; }
}
v=$(set)
On my end (with unpatched ksh):
$ ksh --rc issue616.sh
Memory fault
$ ksh --norc issue616.sh
(no crash)
OK, so the trigger seems to be that my .kshrc defines other discipline functions as well. Though systematic elimination I've found that we need at least two discipline functions to trigger the crash. New minimal reproducer:
RELATIVE_PWD.get()
{
.sh.value=foo;
}
GIT_BRANCH.get()
{
.sh.value=${ echo foo; }
}
v=$(set)
This one crashes (on my end) by simply running ksh issue616.sh
.
As expected, it also crashes with v=$(typeset -p)
instead of v=$(set)
.
I like your patch, will double checkit. The idea of 'mine' was trying to keep as much as the existing behavior as possible, i.e limit the early fork
for now only on set
because I thought that may be this fork avoidance was a kinda optimisation, so I didn't want to penalize all the other 'normal' case.
OK, so the trigger seems to be that my .kshrc defines other discipline functions as well. Though systematic elimination I've found that we need at least two discipline functions to trigger the crash. New minimal reproducer:
RELATIVE_PWD.get() { .sh.value=foo; } GIT_BRANCH.get() { .sh.value=${ echo foo; } } v=$(set)
This one crashes (on my end) by simply running
ksh issue616.sh
.
I think you will be puzzled by this one :-) still on prod ksh, --rc vs --norc seems non relevant (with my rc that may be is not as complicated as yours :-) so I keep --norc here for getting comparable results.
Note that my examples I use /bin/echo
, this was to simulate a real fork/exec like the git
initial example, but on the minimal reproducer this is not necessary.
$ ksh --norc -c 'function Z.get { .sh.value=${ echo yo ; } ; } ; b=$(set)'
Bus error
$ ksh --norc -c 'function z.get { .sh.value=${ echo yo ; } ; } ; b=$(set)'
The later don't crash :-) Well no quite, it crash on second instance :-)
$ ksh --norc -c 'function z.get { .sh.value=${ echo yo ; } ; } ; b=$(set);b=$(set)' # <-- Double call here
Bus error
To me this has to do with the variable name Z
vs z
, one is before _
the other after, I didn't investigate that part yet, but a fear 'intuitively' that something bad may happen around $_
variable. Got no proof of that, so may be this is not related at all.
But in all case
As expected, it also crashes with
v=$(typeset -p)
instead ofv=$(set)
.
Yes make sense, any caller to print_scan()
in a virtual subshell will :)
Using the patch along with my normal startup, I get this now:
[846] mbp13 $ arch/darwin.arm64-64/bin/ksh
[847] mbp13 $ v=$(set)
arch/darwin.arm64-64/bin/ksh: set: warning: [DEBUG] forking at DATE
[848] mbp13 $
But I don't get the memory fault.
DATE
is defined in my startup as
function DATE.get { .sh.value=$(date +%D) ; }
The [DEBUG]
line is temporary, it confirms it's doing what it's supposed to. I'll delete it before committing the fix to git. To get rid of it, delete the corresponding line from typeset.c and rebuild.
Well, dunno why I didn't thought earlier, when I discovered that the name of the variable did matter, i.e Z
is above _
and z
is below, I feared that interaction with discipline messing with _
and print_nameval
accessing _
could be problematic.
I decided to make a brutal kludge, i.e skipping over _
in print_scan
(src/cmd/ksh93/bltins/typeset.c)
if((np=nv_search(*argv++,root,0)) && np!=onp && (!nv_isnull(np) || np->nvfun || nv_isattr(np,~NV_NOFREE)))
{
onp = np;
/* kludge start */
if( (np->nvname[0]=='_')&&(np->nvname[1]==0) )
{ continue;
}
/* kludge end */
And to my grand surprise, no more core dump, and no regression in QA, even though now b=$(set)
miss the _
variable, IMHO this is not a big lost and ledgitmate to withdraw, as we withdraw .sh
etc... the value of _
at set
time will be overloaded right after b=$(set)
as _=$(set)
Anyway this means that may be all the things with fork() is still good, only the _
access in print_scan
is not good, I noticed that when not doing the kludge, i.e that actual prod path, then effectivly np
pointer got corrupted data leading to crashdump.
Is that a trail to follow ?
EDIT:
Well now I got no core dump, but later echo $_
do core dump, so _
is corrupted anyway, simply not being printed in the print_scan() differ it. So may be fork() is back :)
Very interesting trail that, I will pursue it. I do disagree that it is legitimate to skip _
in the output of set
.
BTW, for future reference, something like
if( (np->nvname[0]=='_')&&(np->nvname[1]==0) )
should be written as
if(np==L_ARGNOD)
See src/cmd/ksh93/include/variables.h and src/cmd/ksh93/data/variables.c.
Cool learned something :)
So, printing out name-value pairs with set
or typeset -p
doesn't really have anything to do with this bug -- it was just a trigger. As you've found, corruption of $_
a.k.a. L_ARGNOD
is the real bug. That corruption occurs in the main shell environment when these two discipline functions are executed in a virtual subshell, in this order:
foo.get()
{
.sh.value=foo;
}
bar.get()
{
.sh.value=${ echo foo; }
}
(
: $bar
: $foo
)
echo $_
lldb backtrace with the above reproducer:
$ lldb -o run -- arch/darwin.arm64-64/bin/ksh issue616.sh
(lldb) target create "arch/darwin.arm64-64/bin/ksh"
Current executable set to '/usr/local/src/ksh93/ksh/arch/darwin.arm64-64/bin/ksh' (arm64).
(lldb) settings set -- target.run-args "issue616.sh"
(lldb) run
Process 89854 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x30)
frame #0: 0x000000010006de20 ksh`nv_create(name="_", root=0x854a80010006cf30, flags=1441792, dp=0x000000016fdfe1c8) at name.c:906:17
903 if(n)
904 {
905 ap = nv_arrayptr(np);
-> 906 ap->nelem = dim;
907 nv_putsub(np,NULL,n);
908 }
909 else
Target 0: (ksh) stopped.
Process 89854 launched: '/usr/local/src/ksh93/ksh/arch/darwin.arm64-64/bin/ksh' (arm64)
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x30)
* frame #0: 0x000000010006de20 ksh`nv_create(name="_", root=0x854a80010006cf30, flags=1441792, dp=0x000000016fdfe1c8) at name.c:906:17
frame #1: 0x000000010006bab4 ksh`nv_open(name="_", root=0x0000600002102990, flags=1441792) at name.c:1401:7
frame #2: 0x0000000100062dc0 ksh`varsub(mp=0x00006000017001c0) at macro.c:1359:9
frame #3: 0x000000010005ed64 ksh`copyto(mp=0x00006000017001c0, endch=0, newquote=0) at macro.c:623:21
frame #4: 0x0000000100060220 ksh`sh_macexpand(argp=0x000000010080f078, arghead=0x000000016fdfea30, flag=0) at macro.c:235:2
frame #5: 0x000000010002f9d4 ksh`arg_expand(argp=0x000000010080f078, argchain=0x000000016fdfea30, flag=0) at args.c:818:11
frame #6: 0x000000010002f688 ksh`sh_argbuild(nargs=0x000000016fdff0ac, comptr=0x000000010080f038, flag=0) at args.c:658:9
frame #7: 0x0000000100090e34 ksh`sh_exec(t=0x000000010080f038, flags=5) at xec.c:928:10
frame #8: 0x0000000100020480 ksh`exfile(iop=0x0000600003500370, fno=10) at main.c:608:4
frame #9: 0x000000010001f58c ksh`sh_main(ac=2, av=0x000000016fdff6f0, userinit=0x0000000000000000) at main.c:371:2
frame #10: 0x000000010000465c ksh`main(argc=2, argv=0x000000016fdff6f0) at pmain.c:42:9
frame #11: 0x00000001001ed08c dyld`start + 520
Relevant information from sh.1:
_ Initially, the value of _ is an absolute pathname of the
shell or script being executed as passed in the
environment. Subsequently it is assigned the last argument
of the previous command. This parameter is not set for
commands which are asynchronous. This parameter is also
used to hold the name of the matching MAIL file when
checking for mail. While defining a compound variable or a
type, _ is initialized as a reference to the compound
variable or type. When a discipline function is invoked, _
is initialized as a reference to the variable associated
with the call to this function. Finally when _ is used as
the name of the first variable of a type definition, the
new type is derived from the type of the first variable.
(See Type Variables below.)
"When a discipline function is invoked, _
is initialized as a reference to the variable associated with the call to this function." This happens in set_instance
and unset_instance
in xec.c.
I updated and tested my stash of ksh 93u+m commit builds.
Commit 88a1f3d661f35282a1552abe74ea70ee24cd4aa9 broke it: as of that, $_
outputs an empty value in the reproducer.
Commit 430e47813cacf7366ea4ce03d170b254447c331c introduced the crash.
Ha you beat me on this one, I learned how to monitor L_ARGNOD, thank you for that, and I discovered this.
print_scan() loop until np->nvname=="Z" at this point L_ARGON is OK
print_namval(np,...)
nv_getval(np)
nv_getv(np)
lookups(np)
lookup(np...)
==== Here may be the begining of problems... ====
>407 block(bp,type);
408 block(bp, UNASSIGN); /* make sure nv_setdisc doesn't invalidate
409 sh_pushcontext(&checkpoint, 1);
410 jmpval = sigsetjmp(checkpoint.buff, 0); <==== setjmp No 1 here
411 if(!jmpval)
412 sh_fun(nq,np,NULL); <==== will invoke Z.get()
413 sh_popcontext(&checkpoint);
lookup(np)
sh_fun(np)
==== Second part of the problem 'may be' =====
>3240 if(nq)
3241 mode = set_instance(nq,&node, &nr); <==== Save L_ARGNOD into &node
3242 if(is_abuiltin(np))
3243 {
......
3266 }
3267 else
3268 sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); <=== Here problem with sh_function do longjmp
3269 if(nq) +=== to setjmp No1 so we never restore L_ARGNOD
3270 unset_instance(nq, &node, &nr, mode); <==== restore L_ARGNOD NEVER reached if the above longjmp
3271 fcrestore(&save);
I am not fluent enough with the setjmp/longjmp pushcontext/popcontext, I dunno how they are linked together etc... but I think if you manage set a setjmp/longjmp point so on error in sh_funct() we longjmp back to the restore point for L_ARGNOD, you would be safe and no need update set_instance() unset_instance() unless theire are bugged.
Progress: the following patch fixes the crash introduced in 430e47813cacf7366ea4ce03d170b254447c331c on my end, leaving only the earlier breakage introduced earlier in 88a1f3d661f35282a1552abe74ea70ee24cd4aa9.
Learning that the set_instance
and unset_instance
functions were the direct cause of the crash made me look into how and where they are called. And I found an obvious problem in sh_fun
, the main shell function execution function. It calls set_instance
, but may siglongjmp
before calling unset_instance
. There is other state-restoring code that may fail to run as well, so who knows what other corner-case bugs that may have caused. Clearly, sh_fun
should not siglongjmp
before fully restoring state. This problem has been present all along; we didn't introduce it.
But that still didn't fix the crash for me. The crash turns out to be caused by an omission in our own robustification of discipline functions (see 430e47813cacf7366ea4ce03d170b254447c331c, 2322f939429ae002f92ea333a3bb6b149aad1431). We added a sigsetjmp
to the functions responsible for executing shell discipline functions, assign
and lookup
, but failed to make them actually siglongjmp
when necessary – and this is absolutely necessary when the sh_subfork
call added in added in 430e47813cacf7366ea4ce03d170b254447c331c causes the parent to siglongjmp
! So, this patch adds the missing siglongjmp
calls. Since sh_subfork
siglongjmp
s with SH_JMPSUB
, check for jmpval >= SH_JMPSUB
. And, as with the sh_fun
fix, we make sure to restore all the state first.
Again, after this patch, the reproducer still leaves $_
broken after running the minimal reproducer, but the crash is fixed.
......
3266 } 3267 else 3268 sh_funct(np,n,argv,NULL,sh_isstate(SH_ERREXIT)); <=== Here problem with sh_function do longjmp 3269 if(nq) +=== to setjmp No1 so we never restore L_ARGNOD 3270 unset_instance(nq, &node, &nr, mode); <==== restore L_ARGNOD NEVER reached if the above longjmp 3271 fcrestore(&save);
Great analysis there, thank you! You've made me realise my patch above still did not fix all of the problem. The sigsetjmp/siglongjmp logic in sh_fun() is still wrong. That sh_funct() call should be fully included in it.
Well, it did look like it was fixed properly, but there are a few regression test failures with the latest version of the patch.
test comvar begins at 2023-04-03+00:03:23
comvar.sh[243]: FAIL: should be b.x=456
test comvar failed at 2023-04-03+00:03:23 with exit code 1 [ 103 tests 1 error ]
test functions begins at 2023-04-03+00:03:28
functions.sh[1480]: FAIL: funcname.ksh crash (direct run) (expected $'f\ng-f\nf-g-f\ng-f-g-f\nf-g-f-g-f\ng-f-g-f-g-f\nf-g-f-g-f-g-f\ng-f-g-f-g-f-g-f\nf-g-f-g-f-g-f-g-f\ng-f-g-f-g-f-g-f-g-f\nf-g-f-g-f-g-f-g-f-g-f\ng-f-g-f-g-f-g-f-g-f-\nf-g-f-g-f-g-f-g-f--\ng-f-g-f-g-f-g-f---\nf-g-f-g-f-g-f----\ng-f-g-f-g-f-----\nf-g-f-g-f------\ng-f-g-f-------\nf-g-f--------\ng-f---------\nf----------', got $'f\ni`,\x01\ne`,\x01\ni`,\x01\ne`,\x01\ni`,\x01\ne`,\x01\ni`,\x01\ne`,\x01\ni`,\x01\ng`,\x01\nk`,\x01\ng`,\x01\nk`,\x01\ng`,\x01\nk`,\x01\ng`,\x01\nk`,\x01\ng`,\x01\nk`,\x01\ng`,\x01')
functions.sh[1505]: FAIL: funcname.ksh crash (dot) (expected $'f-\ng-f-\nf-g-f-\ng-f-g-f-\nf-g-f-g-f-\ng-f-g-f-g-f-\nf-g-f-g-f-g-f-\ng-f-g-f-g-f-g-f-\nf-g-f-g-f-g-f-g-f-\ng-f-g-f-g-f-g-f-g-f-\nf-g-f-g-f-g-f-g-f-g-f-\ng-f-g-f-g-f-g-f-g-f--\nf-g-f-g-f-g-f-g-f---\ng-f-g-f-g-f-g-f----\nf-g-f-g-f-g-f-----\ng-f-g-f-g-f------\nf-g-f-g-f-------\ng-f-g-f--------\nf-g-f---------\ng-f----------\nf-----------', got $'c`I\x01\ng`I\x01\nc`I\x01\ng`I\x01\nc`I\x01\ng`I\x01\nc`I\x01\ng`I\x01\nc`I\x01\ng`I\x01\nd`I\x01\nh`I\x01\nd`I\x01\nh`I\x01\nd`I\x01\nh`I\x01\nd`I\x01\nh`I\x01\nd`I\x01\nh`I\x01\nd`I\x01')
test functions failed at 2023-04-03+00:03:30 with exit code 2 [ 132 tests 2 errors ]
I found the cause of the regressions with the latest patch. It doesn't like it when the checkpoint buffer is allocated onto the AST stack with stkalloc
when calling sh_funct
-- some functions use code paths that also use the stack, evidently in an interfering way. When we allocate that buffer on the heap instead, those regressions disappear, and everything seems to be fixed this time! :-)
So, please let me know if you can break the latest patch…
I found the cause of why my second-to-last patch didn't play well with the stack in sh_fun(). The stack state was being saved (base=stakfreeze(0);
) after allocating the checkpoint buffer on the stack. It should be saved before, otherwise it is going to be restored wrong. My mistake.
The previous patch appears to work perfectly, but using the heap is less efficient than using the stack (as the stack allocates memory in larger chunks and is periodically auto-freed), so it's worth making the stack version work correctly.
Patch version four (edit: broken; removed)
Patch version four suddenly started crashing for me. Version three is still okay. Urgh. I'm going to sleep on it before debugging further.
Oh, found the cause already. I was being too smart. Patch version four contains
sh_pushcontext(checkpoint = stakalloc(sizeof(struct checkpt)), SH_JMPCMD);
and that looks like it should work, but sh_pushcontext
is a macro and not a function, and the first argument gets evaluated multiple times so that the checkpoint buffer gets allocated multiple times. Oops.
Jeez you did all this while I was sleeping :) lemme review it, it may take a while...
Ho BTW how do you insert the git diff with highlighted text in a reply ? Thx in advance.
Ha ok I ramp up my knowledge on the struct checkpt
usage. So sh_fun()-->sh_fuct()->..->sh_funscope()->..->sh_fork() etc... can do a longjmp anytime (I cite the call path from memory so the ...
) then no function in the path should do code like this
save something
call inner function // longjmp into whatever callers may have setup
restore something // escaped restore due to longjmp
The inner function may escape the restore except if the restore things is stuffed into a 'struct chkptan
pushcontext()is done, and the restore done by the
popcontext()/errorpop()`. But then it make code review hard to follow.
May be a fix could be to add a layer of save(np)/pushcontext/setjmp in sh_fun() and then catch longjmp/resore(np)/forward longjmp to prev layer of context... dunno if that make sense though :-) OR add a save(np)/restore(np) in any caller to sh_fun() there are not that much and remove the save/restore in sh_func()
Now that I look the fault.h
I see the macro's sh_pushcontext()
and sh_popcontext()
it raise me a general question, why thoses things are not static inline void sh_pushcontext(...{...}
is that because ksh can be built with 'anciant' compilers unable to inline ? even so, the API (a void return and 2 regs args) make the static function call pretty cheap, even on the perf path. To me thoses macro's are drama waiting to happen :-)
EDIT: I meant, not doing a complete refactoring, well may be one is underway for C99, C23 anyway, but I meant more fixing macro to static inline when make sense as we go fixing.
patch-five runs like a champ :) you are a star!
Ho BTW how do you insert the git diff with highlighted text in a reply ?
Start with ```diff
, end with ```
. You can also start with ```sh
, ```c
, etc.
May be a fix could be to add a layer of save(np)/pushcontext/setjmp in sh_fun() and then catch longjmp/resore(np)/forward longjmp to prev layer of context... dunno if that make sense though :-)
Yeah, that is the effect of my patch for the offending sh_funct() call.
Now that I look the fault.h I see the macro's sh_pushcontext() and sh_popcontext() it raise me a general question, why thoses things are not static inline void sh_pushcontext(...{...}
The code base we forked was still trying to be compatible with both K&R C and C++ (but badly failing at it due to bit rot). We've abandoned that but still use C89/C90 as a lowest common denominator (see a34e83192b62493eb0e08874c855c386c1be6615, 1064933eec11591a6a7057a19b094209ed1f7859, 427ea547c72f668722fb30d3d89e7c155d6929c7, and other commits referencing those). So we can still build with gcc 2.95.3, for example. I don't want to modernise beyond that yet, as I like to occasionally test ksh on ancient and obscure systems; they sometimes expose bugs others just quietly tolerate...
Looks like the inline keyword was introduced in C99, so we cannot use it except subject to a feature test. We already use a couple of modern optimisation features (noreturn
and __builtin_unreachable()
) as we added iffe feature tests for them; see src/lib/libast/features/common. If inline
is defined as an empty macro on compilers that don't have it, any functions declared as inline become regular functions.
Or I could just tweak the macros to avoid the multiple evaluations. I'm pretty sure this particular pair of macros is only ever called as if they were void functions, so we should be able to use { } instead of ( ) and declare a local variable to which to assign the first macro parameter, evaluating it just once.
To me thoses macro's are drama waiting to happen :-)
Welcome to this codebase, you'll find many other things like that (and there were many other such things you won't find as we've cleaned them up already) :P
Like, the multibyte character getter macro expansion mbchar(cp)
will increase the char pointer cp
by the byte length of the multibyte character found, though you can't tell from the call itself. That really threw me as I was getting to know the code.
Honestly though, it's not too bad as long as you make sure you stay aware of this issue.
I've learned to make intensive use of Exuberant ctags to immediately jump to the declarations of things and take a quick look at them before doing anything even a little smart. Highly recommended for getting to know the code. You do need an editor that supports ctags, but the major ones (vi, emacs) do and so does my own favourite, joe. I spent the first couple of years awkwardly grepping the code to find things (becoming quite adept at regexing this code…) and I wish I'd discovered ctags much sooner. To create/update the tags file, I use: ctags -R --if0 arch bin src
(the --if0
takes advantage of the #if 0 dummy declarations in src/cmd/ksh93/bltins that are commented as being "for the dictionary generator").
patch-five runs like a champ :) you are a star!
Neat. Thanks for testing and analysing! You really helped, I'm not sure I'd have found the sh_fun() breakage without your analysis.
Ha OK got it, the good old system testing, I used to do just that on my old projects, testing BE and perf on oldies, I had so slow HW at the time, I was tuning the perf like crazy, for the benefit of the latest/greatest HW.
I'll keep the gcc 2.95.3 in mind then.
I personally use cscope(1), an old habit, even though I am using emacs as editor... my .emacs is as old as my .kshrc (ksh88) :-)
Thanx for the ```lang MD thingy will play with that a bit :)
Forgot to say, patch-five run like a champ on ubuntu 22.04 s390x as well.
Tested successful as described in issue with patch 5. I also checked my memory fault notes and did not find anything else to add to the mix.
FWIW, I have also tested patch 5 and haven't found any major breakage (i.e., no crashes), although I did encounter a compiler warning:
/home/johno/GitRepos/KornShell/ksh/src/cmd/ksh93/sh/xec.c: In function 'sh_fun':
/home/johno/GitRepos/KornShell/ksh/src/cmd/ksh93/sh/xec.c:3244:20: warning: assignment to 'struct checkpt *' from incompatible pointer type 'char *' [-Wincompatible-pointer-types]
3244 | checkpoint = stakalloc(sizeof(struct checkpt));
| ^
This is fairly easy to fix (apply on top of patch 5):
--- a/src/cmd/ksh93/sh/xec.c
+++ b/src/cmd/ksh93/sh/xec.c
@@ -3241,7 +3241,7 @@ int sh_fun(Namval_t *np, Namval_t *nq, char *argv[])
n++;
if(nq)
mode = set_instance(nq,&node, &nr);
- checkpoint = stakalloc(sizeof(struct checkpt));
+ checkpoint = (struct checkpt*)stakalloc(sizeof(struct checkpt));
sh_pushcontext(checkpoint, SH_JMPCMD);
jmpval = sigsetjmp(checkpoint->buff,1);
if(jmpval == 0)
Thanks, good point. I should really fix the stak/stk interface to use void pointers.
set
outputs the values of all the variables. This crashes in the following reproducer: