Closed vmihalko closed 4 months ago
I could not run the reproducer without freezing my entire system. Changing sleep inf &
to /bin/sleep 120 &
allowed me to see what's going on. (The external sleep program is a great deal smaller than ksh, so it takes less resources than forking ksh for the sleep builtin.)
In my version of the reproducer, ksh "hung" at 32767 jobs as you expected, but after the first sleep
processes exited, it continued where it left off, restarting the job numbers at 1.
So I don't think this is a really a bug. ksh dutifully waits for some jobs to exit and make their job table entries available so it can create new jobs. This never happens in your reproducer, because your background jobs sleep for infinite time.
I am not necessarily opposed to making it possible to have more than 32767 simultaneous jobs, although I do wonder if there is an actual use case for that.
Either way, I want to research the possible implications more first. For instance, if the job table can now be up to 2^31-1 (2147483647) entries long, I imagine that could take up a heck of a lot of memory.
On second thought, it's a bug.
The overflow happens exactly on jobs.c line 1210. The job_alloc function happily returns an integer value > 32767. Assigning that to a short is a signed short integer overflow, which is undefined behaviour in C.
But on most current systems (all that I know of, anyway), it will simply wrap around to -1, which also happens to be the temporary-failure return code, so it waits. Plus, ksh will re-use low job numbers as they become available again. So, the correct thing accidentally happens and ksh recovers as soon as some jobs exit. But that doesn't make it not a bug.
Your PR cleanly fixes it. Thanks for that.
I could not run the reproducer without freezing my entire system. Changing sleep inf & to /bin/sleep 120 & allowed me to see what's going on. (The external sleep program is a great deal smaller than ksh, so it takes less resources than forking ksh for the sleep builtin.)
Actually, I'd say this a bug in ksh
handling the inf
paramater incorrectly. When run under strace
, the process will not enter a sleep state and will be looping with invalid calls to clock_nanosleep
instead (at least in an aarch64 Fedora 40 VM):
...
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
...
edit: typos
I can't reproduce that on either macOS or Debian (both arm64).
On Debian arm64, I get this strace output for strace ksh -c 'sleep inf'
:
...
rt_sigaction(SIGWINCH, {sa_handler=0xffff86f9cf88, sa_mask=[], sa_flags=SA_INTERRUPT}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0xffff86f9cf88, sa_mask=[], sa_flags=SA_INTERRUPT}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2147483647, tv_nsec=4294967295}, 0xfffff95fd570) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2264800440, tv_nsec=65535},
...and it sleeps, for 65535 seconds at least (presumably for 65535 seconds at a time but I'd have to wait 18.2 hours to test that) (edit: that was nonsense: tv_sec is the seconds, tv_nsec additional nanoseconds).
If you'd like to help debug this on Fedora, the relevant sources are at src/lib/libast/tm/tvsleep.c and src/cmd/ksh93/bltins/sleep.c.
Also, possibly related: #750
This PR addresses a problem where the number of jobs exceeds
2^15 - 1
(32767), the variableshort p_job
overflows and ksh hangs (here).How to reproduce?
reproducer.ksh
:and run
ksh -lic '. ./reproducer.ksh'
.This bug was found during a debugging session with @lzaoral and @msekletar - thank you!