shirok / Gauche

Scheme Scripting Engine
https://practical-scheme.net/gauche
Other
816 stars 81 forks source link

Threads won't work on OpenBSD #276

Open zmyrgel opened 7 years ago

zmyrgel commented 7 years ago

Making issue out of this so its easier to track progress. This was reported on the mailing list a while back. Gauche 0.9.5 doesn't currently work on OpenBSD, it compiles but tests won't pass.

configured as:

CPPFLAGS=-I/usr/local/include LDFLAGS=-L/usr/local/lib ./configure --prefix=$HOME/gosh-new --with-local=/usr/local --enable-threads=pthreads --with-dbm=ndbm --with-libatomic-ops=no --with-slib=/usr/local/share/slib

Gauche compiles but tests get stuck on "testing threads" in the "atom" tests part.

OpenBSD facts:

For one thing, Gauche src/signal.c lists

SIGDEF(SIGXFSZ, SIGDEF_ERROR)

Considering OpenBSD uses that for thread suspend signal shouldn't it be:

SIGDEF(SIGXFSZ, SIGDEF_NOHANDLE)

Also, as OpenBSD thread implementation rthreads seems to use SIGTHR but Gauche doesn't seem to include that in src/signal.c file at all. Should Gauche include:

ifdef SIGTHR

SIGDEF(SIGTHR, SIGDEF_NOHANDLE), / thread library AST (OpenBSD) /

endif

With above changes Gauche compiles but tests pass until they lock on "system" test. This hangs so bad I can't abort the test without manually killing gosh process. If I skip that test, I get all the way to "Testing 'control'" test which hangs and I need to interrupt it. The thread tests pass most of the time, like ~9/10 times but test always stuck on "control" test.

The GC tests pass just fine.

shirok commented 7 years ago

Thanks. I don't have OpenBSD env at hand, but I'll appreciate patches.

One thing - Did SIGDEF(SIGXFSZ, SIGDEF_ERROR) has any adverse effect? The signals used for GC thread start/suspend is excluded from the pool of signals Gauche should care; which is currently done in main.c (sig_setup). Probably I should add comment in signal.c to explain it.

I can add SIGTHR thing. Would you care for PR?

When I got stuck in threads, one thing I do is to attach gdb and look at where each thread is waiting. It may give you a clue (or may not---gdb doesn't show Scheme-level stack trace).

shirok commented 7 years ago

Added SIGTHR thingy https://github.com/shirok/Gauche/commit/272f6618410455b997a8c7f0d660b23f00f1f7bc FreeBSd appears to have it, too, although I don't know if it's used in the same way as OpenBSD.

Please keep on adding stuff here as you find other clues. I'm going to try OpenBSD on VM whenever I get a chance, but can't say the timeframe.

zmyrgel commented 7 years ago

Yeah, you're correct that the SIGXFSZ doesn't need any special casing as its handled in sig_setup.

The "system" tests is first issue so let's focus on it first. Not suprisingly its issues are in signal handling tests.

$ ktrace -di -t cs ./gosh -ftest -I../test system.scm
Testing system ...

-------------------------------------------------------------- test sigalrm1, expects 14 ==> ERROR: GOT #< "sigaction failed when setting a sighandler for signal 32"> test sigalrm2, expects 0 ==> ERROR: GOT #< "sigaction failed when setting a sighandler for signal 32"> test sigalrm3, expects # ==> ok test sigalrm3, expects # ==> ok test sigalrm3, expects # ==> ok test sigalrm3, expects # ==> ok test sigalrm3, expects # ==> ok test sigalrm3, expects # ==> ok ... And it keeps looping the sigalrm3 lines from then on. I'm guessing the problem is with the SIGTHR. > tmy@phobos src $ grep SIGTHR /usr/src/sys/sys/signal.h > #define SIGTHR 32 /* thread library AST */ Here's the kdump when running the test. [trace.txt](https://github.com/shirok/Gauche/files/804764/trace.txt)
shirok commented 7 years ago

Could you run system.scm test with GAUCHE_TEST_REPORT_ERROR environment variable set? It should dump Scheme stack trace whenever the test catches an error.

shirok commented 7 years ago

Actually, user program can't set signal handler for SIGTHR, so I excluded it from master sigmask in https://github.com/shirok/Gauche/commit/226da3ebda2cd4aacbdfa6c768ef076ec5da7f6d I think this is an workaround for "sigaction failed when setting a sighandler for signal 32" error.

zmyrgel commented 7 years ago

Ah, with that its a bit better with different errors: tmy@phobos src $ ktrace -di -t cs ./gosh -ftest -I../test system.scm
Testing system ...

-------------------------------------------------------------- test sigalrm1, expects 14 ==> ok test sigalrm2, expects 0 ==> ok test sigalrm3, expects # ==> ok test sigalrm4 (interrupting syscall), expects 14 ==> ok test sigalrm5 (interrupting syscall - restart), expects (a) ==> ok test sigalrm6 (interrupting syscall - restart), expects (#t 0) ==> ok test fork & sigint, expects #t ==> ERROR: GOT #< "unbound variable: nap"> test sigchld, expects 20 ==> ok test sigmask, expects hup ==> ERROR: GOT #< "unbound variable: nap"> test sigmask during interrupt handler (default), expects #t ==> ok test sigmask during interrupt handler (explicit), expects #t ==> ok test sigmask during interrupt handler (multi/default), expects #t ==> ERROR: GOT #< "unbound variable: nap"> test sigmask during interrupt handler (default), expects #t ==> ok test sigmask during interrupt handler (multi/explicit), expects #t ==> ok ok test sigmask during interrupt handler (explicit), expects #t ==> test sigmask during interrupt handler (reentrance), expects boo ==> ok ok test sigmask during interrupt handler (multi/default), expects #t ==> test sigmask during interrupt handler (multi/reentrance), expects boo ==> ok test sigmask during interrupt handler (multi/explicit), expects #t ==> ok ok test sys-sigwait, expects 1 ==> test sigmask during interrupt handler (reentrance), expects boo ==> ok test sigmask during interrupt handler (multi/reentrance), expects boo ==> ok test sys-sigwait, expects 1 ==> ok test sys-sigwait / signal handler restoration, expects foo ==> ERROR: GOT # failed. discrepancies found. Errors are: test fork & sigint: expects #t => got #< "unbound variable: nap"> ok test sys-sigwait / signal handler restoration, expects foo ==> test sigmask: expects hup => got #< "unbound variable: nap"> ERROR: GOT # test sys-sigwait / signal handler restoration: expects foo => got # I commented other tests out and run just the fork&sigint test: here's the kdump output: [fork.txt](https://github.com/shirok/Gauche/files/805775/fork.txt)
zmyrgel commented 7 years ago

I shouldn't test these before the morning coffee has time to kick in. I removed a bit too much stuff from system.scm to get reduced test case. After restoring the original system.scm test all tests pass.

Now I'm the 'expected' state that thread tests pass most of the time and control tests 'thread-pool' test gets stuck always.

shirok commented 7 years ago

Can you tell which thread-pool test hangs?

zmyrgel commented 7 years ago

tmy@phobos src $ ktrace -di ./gosh -ftest -I../test control.scm Testing control ...

------------------------------------------------------------------ testing bindings in # ... ok test make-job, expects #t ==> ok test job-touch! acknowledge (pre), expects #f ==> ok test job-touch! acknowledge, expects #t ==> ok test job-touch! start (pre), expects #f ==> ok test job-touch! start, expects #t ==> ok test job-touch! finish (pre), expects #f ==> ok test job-touch! finish, expects #t ==> ok test job-run! precondition, expects (#f #f) ==> ok test job-run! postcondition, expects (done ok #t) ==> ok test job-run! error condition, expects (error gosh) ==> ok test job-run! killed, expects (killed test) ==> ok test job-wait, job-kill, expects (killed foo) ==> ok test job-wait and error, expects (error "bang") ==> ok ---------------------------------------------------------- testing bindings in # ... ok <-- hangs at this point --> Here's a bit stripped kdump output about the test. Seems the threads sleep and never wake up: [dump.txt](https://github.com/shirok/Gauche/files/809693/dump.txt)
zmyrgel commented 7 years ago

Hi,

Getting back to this again. I compiled latest git version of Gauche on OpenBSD-current. And the basic threads tests seem to pass but still it hangs on control tests just like above.

I did little tweaking in the control tests and I assume there might be some resource leaking in there which causes the hang up. There's 2 tests on the control.scm, control.job and control.thread-pool. On a normal run the test hangs on the start of control.thread-pool. I noticed that if I comment out the control.thread-pool test, the control.scm tests pass always. If I did the reverse and commented the control.job tests and run just the control.thread-pool test I get bunch of errors like " got #< "unbound variable: job-result">". But once I add "(use control.job)" to control.thread-pool the tests pass.

With this I assume that the first test doesn't leave the control.job internals "clean" and thats what causes the problem on the control.thread-pool tests on OpenBSD. As each test run individually works there seems to be problem in the control.job.

zmyrgel commented 7 years ago

And to reduce this even further, if I comment out the "job-wait, job-kill" test the control.scm tests pass most of the time. Seems that later "thread pool termination, expects terminated" sometimes hangs up.

zmyrgel commented 7 years ago

Hmm, I reversed the control.scm tests and moved the "(use control.job)" to into the control.thread-pool tests beginning. Now the tests seem to pass always.

I guess there is something lurking in the control.job tests, either in the tests, the job handling or in the threads themselves that cause issue on OpenBSD.

shirok commented 7 years ago

Hi @zmyrgel , Thanks for the investigation. It does seem control.job test breaks something, but it's not obvious what it is.

It's a bit old-fashioned, but could you try this? Insert debug stub in thread-pool.scm and run test/control.scm with original settings (i.e. the one that hangs). What output do you get right before the hang?

diff --git a/lib/control/thread-pool.scm b/lib/control/thread-pool.scm
index 330a1bd..39d4486 100644
--- a/lib/control/thread-pool.scm
+++ b/lib/control/thread-pool.scm
@@ -77,7 +77,7 @@
   (set! (~ pool'pool)
         (list-tabulate (~ pool'size)
                        (lambda (_)
-                         (thread-start! (make-thread (cut worker pool)))))))
+                         #?=(thread-start! #?=(make-thread (cut worker pool)))))))

 (define (thread-pool-results pool)    (~ pool'result-queue))
 (define (thread-pool-shut-down? pool) (~ pool'shut-down))

On my Linux box, it is like this:

<control.thread-pool>----------------------------------------------------------
testing bindings in #<module control.thread-pool> ... ok
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?-    #<thread #f (4) new 0x13bb400>
#?-    #<thread #f (4) runnable 0x13bb400>
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?-    #<thread #f (5) new 0x14bfc00>
#?-    #<thread #f (5) runnable 0x14bfc00>
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?-    #<thread #f (6) new 0x14bf400>
#?-    #<thread #f (6) runnable 0x14bf400>
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?-    #<thread #f (7) new 0x151ac00>
#?-    #<thread #f (7) runnable 0x151ac00>
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?-    #<thread #f (8) new 0x151a400>
#?-    #<thread #f (8) runnable 0x151a400>
test pool, expects (5 #t 5 #f) ==> ok
zmyrgel commented 7 years ago

I got following: debug.txt

shirok commented 7 years ago

Oh, so now the test goes up to "test thread pool termination, expects terminated" and then hang? Hmm, this gets weirder...

zmyrgel commented 7 years ago

Ah, sorry. I've been toying with the control tests so those weren't at default setting. I reverted all local changes to control tests and got following output:

root@phobos src $ ./gosh -ftest -I../test control              
Testing control ...                                              
<control.job>------------------------------------------------------------------
testing bindings in #<module control.job> ... ok
test make-job, expects #t ==> ok
test job-touch! acknowledge (pre), expects #f ==> ok
test job-touch! acknowledge, expects #t ==> ok
test job-touch! start (pre), expects #f ==> ok
test job-touch! start, expects #t ==> ok
test job-touch! finish (pre), expects #f ==> ok
test job-touch! finish, expects #t ==> ok
test job-run! precondition, expects (#f #f) ==> ok
test job-run! postcondition, expects (done ok #t) ==> ok
test job-run! error condition, expects (error gosh) ==> ok
test job-run! killed, expects (killed test) ==> ok
test job-wait, job-kill, expects (killed foo) ==> ok
test job-wait and error, expects (error "bang") ==> ok
zmyrgel commented 7 years ago

With following diff the tests pass.

control_diff.txt

shirok commented 7 years ago

With the reverted source, output stops "test job-wait and error"? Or copy & paste misses some portion?

I'm afraid that changing test order is only hiding the real problem, so if possible, I'd like to nail it. In the HEAD code, your output suggests that (make-thread-pool 5) in line 80 is where the execution stuck. The #?= debug output I asked is to see where in make-thread-pool the hang occurs.

zmyrgel commented 7 years ago

Yeah, I didn't mean that this should be fixed by modififying the test, I just wanted to show exactly what kind of test seems to work. Thats the full output I get. I'm attaching the full ktrace output as well. dump.txt

Seems the tests above "job-wait, job-kill" don't affect this. If I comment out those the test still gets stuck after running "job-wait and error" test:

tmy@phobos src $ ./gosh -ftest -I./test ../test/control.scm  
Testing control ...                                              
<control.job>------------------------------------------------------------------
testing bindings in #<module control.job> ... ok
test job-wait, job-kill, expects (killed foo) ==> ok
test job-wait and error, expects (error "bang") ==> ok

Hmm, in the above output "job-wait and error" test seems to succeed as it shows the ok line but the next step won't even start so it hangs outside tests? Looking through the dump output, it seems to load the thread-control sources and soon after it suspends thread which goes to sleep and never wakes up.

If I comment out "job-wait, job-kill" test the tests proceed until they hang on "test thread pool termination, expects terminated". Will the changing the order of tests somehow trigger the wakeup call for the background thread so it won't get to sleep?

shirok commented 7 years ago

According to your ktrace output, nanosleep is getting called repeatedly, so it's not that all threads deadlock. At least some threads are awake, spinning & polling for some condition to be met. I still can't put a finger yet but that might be a clue.

shirok commented 7 years ago

The nanosleep loop also increases the wait period by factor of 2, until it reaches about 16ms. GC_lock in pthread_support.c exactly does that. (AO_pause in atomic_ops.c also uses nanosleep, but the wait period calculation is a bit different.) So I suspect some thread is hanging while grabbing the global GC lock.

zmyrgel commented 7 years ago

Hmm, I'm trying to find smallest code to trigger this.

(use gauche.test)
(use data.queue)
(use gauche.threads)
(use control.job)

(let* ([gate (make-mtqueue :max-length 0)]
       [job (make-job (^[] (enqueue/wait! gate #t) (sys-sleep 10))
                      :waitable #t)]
       [t1 (thread-start! (make-thread (^[] (job-run! job))))]
       [t2 (thread-start! (make-thread (^[] (job-wait job))))])
  (dequeue/wait! gate)
  (job-mark-killed! job 'foo)
  (list (thread-join! t2) (job-result job)))

(use control.thread-pool)

(display "DONE") (newline)

The above hangs on all the runs I tested. But if I remove the "(use control.thread-pool)" or "(use gauche.test)" parts it will pass which is pretty odd. Will using of control.thread-pool or gauche.test allocate the hanging thread?

shirok commented 7 years ago

Nah, neither gauche.test nor control.thread-pool create threads just by loading it. Control.thread-pool does create threads when a thread pool is created, but it's not the case.

Can you test one thing? Set the environment variable GC_MARKERS to 1. It turns off parallel GC.

zmyrgel commented 7 years ago

Doesn't seem to help:

tmy@phobos src $ GC_MARKERS=1 ./gosh -ftest -I./test ../test/control-mine.scm  
^C^C^CReceived too many signals before processing them.  Exitting for the emergency...
tmy@phobos src $ export GC_MARKERS=1
tmy@phobos src $ ./gosh -ftest -I./test ../test/control-mine.scm               
^C^C^CReceived too many signals before processing them.  Exitting for the emergency...
tmy@phobos src $ 
zmyrgel commented 7 years ago

And here's the the above stuck test run with DEBUG_THREADS enabled in GC:

tmy@phobos src $ ./gosh -ftest -I./test ../test/control-mine.scm  
Creating thread 0x9da620201e0
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbded8,0x7f7ffffbf000)
World starting
World started
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbdca8,0x7f7ffffbf000)
World starting
World started
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbc118,0x7f7ffffbf000)
World starting
World started
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbb448,0x7f7ffffbf000)
World starting
World started
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbc0d8,0x7f7ffffbf000)
World starting
World started
About to start new thread from thread 0x9da620201e0
Started thread 0x9da86c9fc38
Starting thread 0x9da86c9fc38, pid = 32641, sp = 0x9da46a93890
Creating thread 0x9da86c9fc38
start_routine = 0x9da2bdab1c0
About to start new thread from thread 0x9da620201e0
Started thread 0x9daf1df5838
Starting thread 0x9daf1df5838, pid = 32641, sp = 0x9dabe00c770
Creating thread 0x9daf1df5838
start_routine = 0x9da2bdab1c0
Finishing thread 0x9daf1df5838
Called GC_thread_exit_proc on 0x9daf1df5838, gc_thread = 0x9d82781b840
Unregistering thread 0x9daf1df5838, gc_thread = 0x9d82781b840, n_threads = 3
Deleting thread 0x9daf1df5838, n_threads = 3
Stopping the world from 0x9da620201e0
Sending suspend signal to 0x9da86c9fc38
zmyrgel commented 7 years ago

Did a bit of testing. With GC_DONT_GC=1 the tests pass so this seems to be GC issue. Here's output when running the control tests with GC_PRINT_STATS: output.txt

The last line is me aborting the test with "^C^C^C" combo.

shirok commented 7 years ago

Thanks. It is consistent with the hypothesis above that it's stuck at the nanosleep loop in GC_lock(). The fact that GC test itself passes means either Gauche does something wrong, or we stepped onto some corner case bug.

The way I'd proceed would be complicated. The thing is that the GC thread sends a suspend signal to all other thread and wait for every thread to say hey I'm stopping, but somehow some threads fail to respond. So either the suspend signal is lost or ignored. We could put some debug stub in the signal handler regarding the suspend signal and see which thread fails to react. If I have time I'd set up VM to run OpenBSD to delve into by myself, but I don't see I can do that in near future...

zmyrgel commented 7 years ago

When I was testing it with it reduced test it only ran 2 threads. I also run the tests with GC_RETRY_SIGNALS flag so it sends the suspend signal multiple times but it still didn't proceed. So I'd say the signal is not lost, most likely ignored. I'll try to get meaningful backtrace from the threads here if they would give some clue whats causing the issue.

zmyrgel commented 7 years ago

Ok, run the test in gdb, it got stuck. I interrupted it and run "thread apply all bt": debug.txt

Hmm, so thread 1 seems to init the garbage collection from Scm_Cons but in thread 2 Scm_Acons takes the GC_lock and they wait for each other?

shirok commented 7 years ago

Thanks, it does seem ignored signal case. Thread 1 already got the lock, sending out suspend signal to other thread, and wait on the semaphore for everybody to stop. Thread 2 somehow oblivious to that and naively tries to get the lock, find it's already locked, and spins on it. The fact that GC_RETRY_SIGNALS didn't work suggests we're blocking the suspend signal?