ksh93 / ksh

ksh 93u+m: KornShell lives! | Latest release: https://github.com/ksh93/ksh/releases
Eclipse Public License 2.0
192 stars 32 forks source link

ksh93 1.0.9 build crashes on Cygwin 3.5.3 #758

Closed gisburn closed 4 months ago

gisburn commented 5 months ago

ksh93 1.0.9 build crashes on Cygwin 3.5.3 like this: ---- snip ----

# src/Mamfile: 13-15: make all
+ mamake -r '*/*' install

# ... making src/cmd/INIT ...
+ mamake -C cmd/INIT install
      1 [main] cc 1788 dofork: child -1 - forked process 676 died unexpectedly, retry 0, exit code 0xC0000142, errno 11
2132015 [main] cc 1788 dofork: child -1 - forked process 8768 died unexpectedly, retry 0, exit code 0xC0000142, errno 11
6309692 [main] cc 1788 dofork: child -1 - forked process 9148 died unexpectedly, retry 0, exit code 0xC0000142, errno 11
14449471 [main] cc 1788 dofork: child -1 - forked process 5832 died unexpectedly, retry 0, exit code 0xC0000142, errno 11
30614050 [main] cc 1788 dofork: child -1 - forked process 6788 died unexpectedly, retry 0, exit code 0xC0000142, errno 11
mamprobe: line 25: cannot fork [Resource temporarily unavailable]
mamake [cmd/INIT]: /home/roland_mainz/work/windows_ksh93/ksh-1.0.9/src/cmd/INIT/Mamfile: 8: /home/roland_mainz/work/windows_ksh93/ksh-1.0.9/arch/cygwin.i386/lib/probe/C/mam/IKPPKKN: cannot generate probe info
mamake: *** exit code 1 making cmd/INIT
mamake: *** exit code 1 making all
package: make failed at Tue Jul  2 12:53:02 CEST 2024 in /home/roland_mainz/work/windows_ksh93/ksh-1.0.9/arch/cygwin.i386

---- snip ----

This was working with ksh93 1.0.8... no clue yet what causes mamake/mamprove to fail like that

McDutchie commented 4 months ago

Yeah, 11 is SIGSEGV, so mamake is somehow crashing. :-(

Unfortunately, for testing Cygwin I only have access to a 32-bit Windows 7 VM with an obsolete version of Cygwin, but it's working fine on that... so I have no real way of tracing the cause here.

A gdb stack trace might help, or perhaps you could do a 'git bisect' between the v1.0.8 and v1.0.9 tags to identify the offending commit?

McDutchie commented 4 months ago

Ping @JohnoKing – if you have the time, are you able to reproduce this?

McDutchie commented 4 months ago

I will also note that I've built ksh with AddressSanitizer (ASan) enabled many times (as shipped with Apple clang version 14.0.0 (clang-1400.0.29.202)), which also makes mamake run with it enabled... and it has not found any issues.

gisburn commented 4 months ago

I will also note that I've built ksh with AddressSanitizer (ASan) enabled many times (as shipped with Apple clang version 14.0.0 (clang-1400.0.29.202)), which also makes mamake run with it enabled... and it has not found any issues.

ASAN does not find reads-from-uninitalised-memory, which can result in undefined behaviour/random crashes. I filed issue #759 (use valgrind for BUILD (not ksh93)) to finally get this mess hunted down...

gisburn commented 4 months ago

BTW: 0xC0000142 is |STATUS_DLL_INIT_FAILED|, so after |fork()| the DLL init failed.

gisburn commented 4 months ago

BTW: 0xC0000142 is |STATUS_DLL_INIT_FAILED|, so after |fork()| the DLL init failed.

Seems to be related to the (DLL) memory layout. Explicitly setting the stack size seems to work around this issue (note the surrounding bash instance is required to avoid another hang when probing cc): bash -c 'ulimit -s 2048 ; CCFLAGS="-g" SHELL=/bin/bash bash ./bin/package make SHELL=/bin/bash CCFLAGS="-g"' 2>&1 | tee buildlog.log

gisburn commented 4 months ago

BTW: 0xC0000142 is |STATUS_DLL_INIT_FAILED|, so after |fork()| the DLL init failed.

Seems to be related to the (DLL) memory layout. Explicitly setting the stack size seems to work around this issue (note the surrounding bash instance is required to avoid another hang when probing cc): bash -c 'ulimit -s 2048 ; CCFLAGS="-g" SHELL=/bin/bash bash ./bin/package make SHELL=/bin/bash CCFLAGS="-g"' 2>&1 | tee buildlog.log

Update this didn't work, it now fails after finishing libast:

+ cp mamake /home/roland_mainz/work/windows_ksh93/ksh-1.0.9/arch/cygwin.i386/bin/mamake

# ... making src/lib/libsum ...
      0 [main] mamake 40713 dofork: child -1 - forked process 1232 died unexpectedly, retry 0, exit code 0xC0000142, errno 11
mamake: fork() failed: Resource temporarily unavailable
mamake: *** exit code 1 making all
[here it HANGS!! until <CTRL-C>]

I tried to vary the stack size up to $ ulimit -s 8192 # ... but not success yet... ;-((

McDutchie commented 4 months ago

Attached is the log of a build on Debian 11 arm64, after applying the MAMAKE_DEBUG_PREFIX patch from https://github.com/ksh93/ksh/issues/759#issuecomment-2203984213, and building with:

MAMAKE_DEBUG_PREFIX='valgrind -s --tool=memcheck --leak-check=full' CCFLAGS='-O0 -g' bin/package make

As you can see, the build succeeds and valgrind reports zero problems in mamake. The memcheck tool is supposed to report use of initialised memory, but finds none. Can you think of other valgrind options for deeper testing?

mamake-valgrind.txt

gisburn commented 4 months ago

I bypassed the |fork()| error with a debug version of the Cygwin 3.5.3 DLL (for now) and now end-up with this issue:

# src/lib/libast/Mamfile: 4980-4983: make ${INSTALLROOT}/lib/file/magic
+ cp -f /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/misc/magic.tab /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/lib/file/magic

# src/lib/libast/Mamfile: 4985-5000: make ${INSTALLROOT}/lib/package/gen/.asthdr_tstamp
+ CCt=' '
+ for src in /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/libgen.h
++ basename /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h .h
+ hdr=fmtmsg
+ grep -q 'define[      ][      ]*_[hl][di][rb]_fmtmsg' ast_lib.h
+ dst=/cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/include/ast/fmtmsg.h
+ cmp -s /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/include/ast/fmtmsg.h
+ cp -f /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/include/ast/fmtmsg.h
+ for src in /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/libgen.h
++ basename /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/libgen.h .h
+ hdr=libgen
+ grep -q 'define[      ][      ]*_[hl][di][rb]_libgen' ast_lib.h
+ continue
+ touch /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/lib/package/gen/.asthdr_tstamp

# src/lib/libast/Mamfile: 5008-5014: make mamake.o
+ compile /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/cmd/INIT/mamake.c -D_PACKAGE_ast
+ s=/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/cmd/INIT/mamake.c
+ shift
+ cc -D_BLD_ast -Os -g -fno-strict-aliasing -I. -I/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast -Icomp -I/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp -D_PACKAGE_ast -Iinclude -I/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/include -Istd -I/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/std -c /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/cmd/INIT/mamake.c

# src/lib/libast/Mamfile: 5007-5024: make mamake
+ cc -g -fno-strict-aliasing -o mamake mamake.o libast.a -liconv
+ set +f
+ exec rm -rf /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake.exe
rm: cannot remove '/cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake': Permission denied
rm: cannot remove '/cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake.exe': Permission denied
+ cp mamake /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake
cp: cannot create regular file '/cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake.exe': Device or resource busy
mamake [lib/libast]: *** exit code 1 making mamake
mamake: *** exit code 1 making lib/libast
mamake: *** exit code 1 making all
package: make failed at Wed Jul  3 20:35:33 CEST 2024 in /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386

The problem here is that you cannot delete/move a file of a binary which is being executed.

gisburn commented 4 months ago

I bypassed the |fork()| error with a debug version of the Cygwin 3.5.3 DLL (for now) and now end-up with this issue:

# src/lib/libast/Mamfile: 4980-4983: make ${INSTALLROOT}/lib/file/magic
+ cp -f /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/misc/magic.tab /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/lib/file/magic

# src/lib/libast/Mamfile: 4985-5000: make ${INSTALLROOT}/lib/package/gen/.asthdr_tstamp
+ CCt=' '
+ for src in /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/libgen.h
++ basename /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h .h
+ hdr=fmtmsg
+ grep -q 'define[      ][      ]*_[hl][di][rb]_fmtmsg' ast_lib.h
+ dst=/cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/include/ast/fmtmsg.h
+ cmp -s /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/include/ast/fmtmsg.h
+ cp -f /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/include/ast/fmtmsg.h
+ for src in /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/fmtmsg.h /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/libgen.h
++ basename /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp/libgen.h .h
+ hdr=libgen
+ grep -q 'define[      ][      ]*_[hl][di][rb]_libgen' ast_lib.h
+ continue
+ touch /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/lib/package/gen/.asthdr_tstamp

# src/lib/libast/Mamfile: 5008-5014: make mamake.o
+ compile /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/cmd/INIT/mamake.c -D_PACKAGE_ast
+ s=/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/cmd/INIT/mamake.c
+ shift
+ cc -D_BLD_ast -Os -g -fno-strict-aliasing -I. -I/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast -Icomp -I/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/comp -D_PACKAGE_ast -Iinclude -I/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/include -Istd -I/cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/lib/libast/std -c /cygdrive/l/builds/ksh93_109/ksh-1.0.9/src/cmd/INIT/mamake.c

# src/lib/libast/Mamfile: 5007-5024: make mamake
+ cc -g -fno-strict-aliasing -o mamake mamake.o libast.a -liconv
+ set +f
+ exec rm -rf /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake.exe
rm: cannot remove '/cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake': Permission denied
rm: cannot remove '/cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake.exe': Permission denied
+ cp mamake /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake
cp: cannot create regular file '/cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386/bin/mamake.exe': Device or resource busy
mamake [lib/libast]: *** exit code 1 making mamake
mamake: *** exit code 1 making lib/libast
mamake: *** exit code 1 making all
package: make failed at Wed Jul  3 20:35:33 CEST 2024 in /cygdrive/l/builds/ksh93_109/ksh-1.0.9/arch/cygwin.i386

The problem here is that you cannot delete/move a file of a binary which is being executed.

Interestingly I can just re-start $ ./bin/package make #, and the build then resumes until I have a finished build. That's good enough for testing, but not enough for a Cygwin package.

One possible fix might worklike this: First mamake.exe is called mamake_bootstap.exe, and a softlink links mamake.exe to mamake_bootstrap.exe, and the 2nd build of mamake.exe first removes the softlink, and then creates the new mamake.exe.

McDutchie commented 4 months ago

I bypassed the |fork()| error with a debug version of the Cygwin 3.5.3 DLL (for now)

Very strange. Why would that make mamake work?

By the way, mamake's use of fork(2) is here. Do you see any problems with that code? I rewrote that function to use fork(2) instead of the deprecated system(3) in eeab3e53268a9b233d22c3dee1a2fecea7ca4502, because system(3) was not working properly on Android.

The problem here is that you cannot delete/move a file of a binary which is being executed.

Good grief… This is working fine on Cygwin 3.3.5 on Windows 7. Why would they break this? It's fundamental to Unix that unlinking and deleting are not actually the same thing.

Anyway, for now you can just delete lines 5021-5043 in src/lib/libast/Mamfile, so you can test the rest of the build process. Linking mamake against libast is non-essential, it's just for completeness' sake (and it is a way of failing early if libast somehow breaks). It will work fine without.

McDutchie commented 4 months ago

One possible fix might worklike this: First mamake.exe is called mamake_bootstap.exe, and a softlink links mamake.exe to mamake_bootstrap.exe, and the 2nd build of mamake.exe first removes the softlink, and then creates the new mamake.exe.

This should do that:

diff --git a/bin/package b/bin/package
index 8d969c8b2..f9ba6ab66 100755
--- a/bin/package
+++ b/bin/package
@@ -2515,7 +2515,8 @@ checkaout()   # cmd ...
                done
            fi
            rm -f "$INSTALLROOT/dyn/bin/$i" "$INSTALLROOT/src/lib/libast/$i" &
-           $exec $CC -O $CCFLAGS $LDFLAGS -o $INSTALLROOT/bin/$i $INITROOT/$i.c || return
+           $exec $CC -O $CCFLAGS $LDFLAGS -o $INSTALLROOT/bin/${i}_bootstrap $INITROOT/$i.c || return
+           $exec ln -s ${i}_bootstrap $INSTALLROOT/bin/$i
            test -f $i.o && $exec rm -f $i.o
            hash -r
            ;;
gisburn commented 4 months ago

I bypassed the |fork()| error with a debug version of the Cygwin 3.5.3 DLL (for now)

Very strange. Why would that make mamake work?

I don't know (yet). I didn't had time for an analysis yet, I'm swamped with day-to-day work+ms-nfs41-client.

By the way, mamake's use of fork(2) is here. Do you see any problems with that code? I rewrote that function to use fork(2) instead of the deprecated system(3) in eeab3e5, because system(3) was not working properly on Android.

The use of |fork()| looks OK, but I would've used |posix_spawn()| (which is no help for Cygwin, as it uses |vfork()| to implement |posix_spawn()|, instead of the Windows |_spawn()| family of functions (see https://learn.microsoft.com/en-us/cpp/c-runtime-library/spawn-wspawn-functions?view=msvc-170).

My suggestion would be to put the |system()|-based codepath back within |#ifdef CYGWIN| for now, and maybe for UWIN later, too.

The problem here is that you cannot delete/move a file of a binary which is being executed.

Good grief… This is working fine on Cygwin 3.3.5 on Windows 7. Why would they break this? It's fundamental to Unix that unlinking and deleting are not actually the same thing.

Oh, it even gets worse, because Win10 supports the POSIX semantics, but only if the filesystem supports it and matching registry entries are set for NTFS, e.g. Google for |FILE_DISPOSITION_FLAG_POSIX_SEMANTICS|.

Anyway, for now you can just delete lines 5021-5043 in src/lib/libast/Mamfile, so you can test the rest of the build process. Linking mamake against libast is non-essential, it's just for completeness' sake (and it is a way of failing early if libast somehow breaks). It will work fine without.

AFAIK Glenn Fowler once said they did that to test libast at build time, so it has a useful purpose.

McDutchie commented 4 months ago

I bypassed the |fork()| error with a debug version of the Cygwin 3.5.3 DLL (for now)

Very strange. Why would that make mamake work?

I don't know (yet). I didn't had time for an analysis yet, I'm swamped with day-to-day work+ms-nfs41-client.

No worries, we all understand that life and $DAYJOB happen.

I'm glad to have someone from the old AST crowd take an interest in this revival project, and I do hope you stick around, as and when you have the time.

By the way, mamake's use of fork(2) is here. Do you see any problems with that code? I rewrote that function to use fork(2) instead of the deprecated system(3) in eeab3e5, because system(3) was not working properly on Android.

The use of |fork()| looks OK, but I would've used |posix_spawn()| (which is no help for Cygwin, as it uses |vfork()| to implement |posix_spawn()|, instead of the Windows |_spawn()| family of functions (see https://learn.microsoft.com/en-us/cpp/c-runtime-library/spawn-wspawn-functions?view=msvc-170).

posix_spawn(2) is too new for us to rely on. I'm testing ksh on multiple older systems that don't have it.

ksh does use posix_spawn where available, based on a feature test.

My suggestion would be to put the |system()|-based codepath back within |#ifdef CYGWIN| for now, and maybe for UWIN later, too.

I'd prefer to have it figured out why fork(2) isn't working and get the cause fixed. If fork is broken, then ksh is going to have problems too. Even if it can use posix_spawn to invoke programs, it must still fork for things like background jobs or ulimit in a subshell.

As for UWIN, since we could not find any way to build it, or even to get a compiler running on the one findable binary distribution, we have never been able to test it. So I've removed the code to support it to avoid further bit rot. See d9a85caf2248a2e3221a60cc91ad03bf567b5f68. It's a shame it's dead, as it seemed a great deal nicer than cygwin.

Anyway, for now you can just delete lines 5021-5043 in src/lib/libast/Mamfile, so you can test the rest of the build process. Linking mamake against libast is non-essential, it's just for completeness' sake (and it is a way of failing early if libast somehow breaks). It will work fine without.

AFAIK Glenn Fowler once said they did that to test libast at build time, so it has a useful purpose.

Well, the libast code (for --man, etc.) was there in mamake, but it was never built, so I added the code to build it myself.

By the way, I've reworked mamake quite a bit (instead of restoring nmake, which proved unworkable). See README-mamake.md for info on the features added, such as automatic variables. I think you'll find the Mamfiles quite maintainable now.

gisburn commented 4 months ago
+         $exec $CC -O $CCFLAGS $LDFLAGS -o $INSTALLROOT/bin/${i}_bootstrap $INITROOT/$i.c || return
+         $exec ln -s ${i}_bootstrap $INSTALLROOT/bin/$i

The patch fixes the problem. Please commit it, and close this bug as "FIXED".

McDutchie commented 4 months ago

Wait, it fixes the fork problem too?

gisburn commented 4 months ago

Wait, it fixes the fork problem too?

Well, the DLL is no longer ETXTBUSY, so yes.

McDutchie commented 4 months ago

But that makes no sense: in your original report, the build was failing right at the start, it didn't even manage to make src/cmd/INIT. So replacing mamake cannot have been the cause of the failure, as mamake isn't getting replaced until after building libast. Something else must have been going wrong.

gisburn commented 4 months ago

But that makes no sense: in your original report, the build was failing right at the start, it didn't even manage to make src/cmd/INIT. So replacing mamake cannot have been the cause of the failure, as mamake isn't getting replaced until after building libast. Something else must have been going wrong.

I don't know what went wrong, best guess by the Cygwin people is that this kind of error happens if you write into a DLL or EXE which is being used.

In any case, the patch fixed it.

McDutchie commented 4 months ago

I am now able to reproduce the build failure on Cygwin at the point where it relinks mamake against libast after building libast. While your original report (which has the failure at a much earlier stage) still doesn't make sense to me, the patch does fix the issue I am able to reproduce, so that will have to do. Thanks for the report and the testing!