Open zmyrgel opened 7 years ago
Thanks. I don't have OpenBSD env at hand, but I'll appreciate patches.
One thing - Did SIGDEF(SIGXFSZ, SIGDEF_ERROR)
has any adverse effect? The signals used for GC thread start/suspend is excluded from the pool of signals Gauche should care; which is currently done in main.c (sig_setup). Probably I should add comment in signal.c to explain it.
I can add SIGTHR thing. Would you care for PR?
When I got stuck in threads, one thing I do is to attach gdb and look at where each thread is waiting. It may give you a clue (or may not---gdb doesn't show Scheme-level stack trace).
Added SIGTHR thingy https://github.com/shirok/Gauche/commit/272f6618410455b997a8c7f0d660b23f00f1f7bc FreeBSd appears to have it, too, although I don't know if it's used in the same way as OpenBSD.
Please keep on adding stuff here as you find other clues. I'm going to try OpenBSD on VM whenever I get a chance, but can't say the timeframe.
Yeah, you're correct that the SIGXFSZ doesn't need any special casing as its handled in sig_setup.
The "system" tests is first issue so let's focus on it first. Not suprisingly its issues are in signal handling tests.
$ ktrace -di -t cs ./gosh -ftest -I../test system.scm
Testing system ...
Could you run system.scm test with GAUCHE_TEST_REPORT_ERROR environment variable set? It should dump Scheme stack trace whenever the test catches an error.
Actually, user program can't set signal handler for SIGTHR, so I excluded it from master sigmask in https://github.com/shirok/Gauche/commit/226da3ebda2cd4aacbdfa6c768ef076ec5da7f6d I think this is an workaround for "sigaction failed when setting a sighandler for signal 32" error.
Ah, with that its a bit better with different errors:
tmy@phobos src $ ktrace -di -t cs ./gosh -ftest -I../test system.scm
Testing system ...
I shouldn't test these before the morning coffee has time to kick in. I removed a bit too much stuff from system.scm to get reduced test case. After restoring the original system.scm test all tests pass.
Now I'm the 'expected' state that thread tests pass most of the time and control tests 'thread-pool' test gets stuck always.
Can you tell which thread-pool test hangs?
tmy@phobos src $ ktrace -di ./gosh -ftest -I../test control.scm Testing control ...
Hi,
Getting back to this again. I compiled latest git version of Gauche on OpenBSD-current. And the basic threads tests seem to pass but still it hangs on control tests just like above.
I did little tweaking in the control tests and I assume there might be some resource leaking in there which causes the hang up.
There's 2 tests on the control.scm, control.job and control.thread-pool. On a normal run the test hangs on the start of control.thread-pool. I noticed that if I comment out the control.thread-pool test, the control.scm tests pass always. If I did the reverse and commented the control.job tests and run just the control.thread-pool test I get bunch of errors like " got #<
With this I assume that the first test doesn't leave the control.job internals "clean" and thats what causes the problem on the control.thread-pool tests on OpenBSD. As each test run individually works there seems to be problem in the control.job.
And to reduce this even further, if I comment out the "job-wait, job-kill" test the control.scm tests pass most of the time. Seems that later "thread pool termination, expects terminated" sometimes hangs up.
Hmm, I reversed the control.scm tests and moved the "(use control.job)" to into the control.thread-pool tests beginning. Now the tests seem to pass always.
I guess there is something lurking in the control.job tests, either in the tests, the job handling or in the threads themselves that cause issue on OpenBSD.
Hi @zmyrgel , Thanks for the investigation. It does seem control.job test breaks something, but it's not obvious what it is.
It's a bit old-fashioned, but could you try this? Insert debug stub in thread-pool.scm and run test/control.scm with original settings (i.e. the one that hangs). What output do you get right before the hang?
diff --git a/lib/control/thread-pool.scm b/lib/control/thread-pool.scm
index 330a1bd..39d4486 100644
--- a/lib/control/thread-pool.scm
+++ b/lib/control/thread-pool.scm
@@ -77,7 +77,7 @@
(set! (~ pool'pool)
(list-tabulate (~ pool'size)
(lambda (_)
- (thread-start! (make-thread (cut worker pool)))))))
+ #?=(thread-start! #?=(make-thread (cut worker pool)))))))
(define (thread-pool-results pool) (~ pool'result-queue))
(define (thread-pool-shut-down? pool) (~ pool'shut-down))
On my Linux box, it is like this:
<control.thread-pool>----------------------------------------------------------
testing bindings in #<module control.thread-pool> ... ok
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?- #<thread #f (4) new 0x13bb400>
#?- #<thread #f (4) runnable 0x13bb400>
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?- #<thread #f (5) new 0x14bfc00>
#?- #<thread #f (5) runnable 0x14bfc00>
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?- #<thread #f (6) new 0x14bf400>
#?- #<thread #f (6) runnable 0x14bf400>
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?- #<thread #f (7) new 0x151ac00>
#?- #<thread #f (7) runnable 0x151ac00>
#?="../lib/control/thread-pool.scm":80:(thread-start! (debug-print (make-thread (cut worker pool))))
#?="../lib/control/thread-pool.scm":80:(make-thread (cut worker pool))
#?- #<thread #f (8) new 0x151a400>
#?- #<thread #f (8) runnable 0x151a400>
test pool, expects (5 #t 5 #f) ==> ok
Oh, so now the test goes up to "test thread pool termination, expects terminated" and then hang? Hmm, this gets weirder...
Ah, sorry. I've been toying with the control tests so those weren't at default setting. I reverted all local changes to control tests and got following output:
root@phobos src $ ./gosh -ftest -I../test control
Testing control ...
<control.job>------------------------------------------------------------------
testing bindings in #<module control.job> ... ok
test make-job, expects #t ==> ok
test job-touch! acknowledge (pre), expects #f ==> ok
test job-touch! acknowledge, expects #t ==> ok
test job-touch! start (pre), expects #f ==> ok
test job-touch! start, expects #t ==> ok
test job-touch! finish (pre), expects #f ==> ok
test job-touch! finish, expects #t ==> ok
test job-run! precondition, expects (#f #f) ==> ok
test job-run! postcondition, expects (done ok #t) ==> ok
test job-run! error condition, expects (error gosh) ==> ok
test job-run! killed, expects (killed test) ==> ok
test job-wait, job-kill, expects (killed foo) ==> ok
test job-wait and error, expects (error "bang") ==> ok
With following diff the tests pass.
With the reverted source, output stops "test job-wait and error"? Or copy & paste misses some portion?
I'm afraid that changing test order is only hiding the real problem, so if possible, I'd like to nail it. In the HEAD code, your output suggests that (make-thread-pool 5) in line 80 is where the execution stuck. The #?= debug output I asked is to see where in make-thread-pool the hang occurs.
Yeah, I didn't mean that this should be fixed by modififying the test, I just wanted to show exactly what kind of test seems to work. Thats the full output I get. I'm attaching the full ktrace output as well. dump.txt
Seems the tests above "job-wait, job-kill" don't affect this. If I comment out those the test still gets stuck after running "job-wait and error" test:
tmy@phobos src $ ./gosh -ftest -I./test ../test/control.scm
Testing control ...
<control.job>------------------------------------------------------------------
testing bindings in #<module control.job> ... ok
test job-wait, job-kill, expects (killed foo) ==> ok
test job-wait and error, expects (error "bang") ==> ok
Hmm, in the above output "job-wait and error" test seems to succeed as it shows the ok line but the next step won't even start so it hangs outside tests? Looking through the dump output, it seems to load the thread-control sources and soon after it suspends thread which goes to sleep and never wakes up.
If I comment out "job-wait, job-kill" test the tests proceed until they hang on "test thread pool termination, expects terminated". Will the changing the order of tests somehow trigger the wakeup call for the background thread so it won't get to sleep?
According to your ktrace output, nanosleep is getting called repeatedly, so it's not that all threads deadlock. At least some threads are awake, spinning & polling for some condition to be met. I still can't put a finger yet but that might be a clue.
The nanosleep loop also increases the wait period by factor of 2, until it reaches about 16ms. GC_lock in pthread_support.c exactly does that. (AO_pause in atomic_ops.c also uses nanosleep, but the wait period calculation is a bit different.) So I suspect some thread is hanging while grabbing the global GC lock.
Hmm, I'm trying to find smallest code to trigger this.
(use gauche.test)
(use data.queue)
(use gauche.threads)
(use control.job)
(let* ([gate (make-mtqueue :max-length 0)]
[job (make-job (^[] (enqueue/wait! gate #t) (sys-sleep 10))
:waitable #t)]
[t1 (thread-start! (make-thread (^[] (job-run! job))))]
[t2 (thread-start! (make-thread (^[] (job-wait job))))])
(dequeue/wait! gate)
(job-mark-killed! job 'foo)
(list (thread-join! t2) (job-result job)))
(use control.thread-pool)
(display "DONE") (newline)
The above hangs on all the runs I tested. But if I remove the "(use control.thread-pool)" or "(use gauche.test)" parts it will pass which is pretty odd. Will using of control.thread-pool or gauche.test allocate the hanging thread?
Nah, neither gauche.test nor control.thread-pool create threads just by loading it. Control.thread-pool does create threads when a thread pool is created, but it's not the case.
Can you test one thing? Set the environment variable GC_MARKERS to 1. It turns off parallel GC.
Doesn't seem to help:
tmy@phobos src $ GC_MARKERS=1 ./gosh -ftest -I./test ../test/control-mine.scm
^C^C^CReceived too many signals before processing them. Exitting for the emergency...
tmy@phobos src $ export GC_MARKERS=1
tmy@phobos src $ ./gosh -ftest -I./test ../test/control-mine.scm
^C^C^CReceived too many signals before processing them. Exitting for the emergency...
tmy@phobos src $
And here's the the above stuck test run with DEBUG_THREADS enabled in GC:
tmy@phobos src $ ./gosh -ftest -I./test ../test/control-mine.scm
Creating thread 0x9da620201e0
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbded8,0x7f7ffffbf000)
World starting
World started
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbdca8,0x7f7ffffbf000)
World starting
World started
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbc118,0x7f7ffffbf000)
World starting
World started
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbb448,0x7f7ffffbf000)
World starting
World started
Stopping the world from 0x9da620201e0
World stopped from 0x9da620201e0
Pushing stacks from thread 0x9da620201e0
Stack for thread 0x9da620201e0 = [0x7f7ffffbc0d8,0x7f7ffffbf000)
World starting
World started
About to start new thread from thread 0x9da620201e0
Started thread 0x9da86c9fc38
Starting thread 0x9da86c9fc38, pid = 32641, sp = 0x9da46a93890
Creating thread 0x9da86c9fc38
start_routine = 0x9da2bdab1c0
About to start new thread from thread 0x9da620201e0
Started thread 0x9daf1df5838
Starting thread 0x9daf1df5838, pid = 32641, sp = 0x9dabe00c770
Creating thread 0x9daf1df5838
start_routine = 0x9da2bdab1c0
Finishing thread 0x9daf1df5838
Called GC_thread_exit_proc on 0x9daf1df5838, gc_thread = 0x9d82781b840
Unregistering thread 0x9daf1df5838, gc_thread = 0x9d82781b840, n_threads = 3
Deleting thread 0x9daf1df5838, n_threads = 3
Stopping the world from 0x9da620201e0
Sending suspend signal to 0x9da86c9fc38
Did a bit of testing. With GC_DONT_GC=1 the tests pass so this seems to be GC issue. Here's output when running the control tests with GC_PRINT_STATS: output.txt
The last line is me aborting the test with "^C^C^C" combo.
Thanks. It is consistent with the hypothesis above that it's stuck at the nanosleep loop in GC_lock(). The fact that GC test itself passes means either Gauche does something wrong, or we stepped onto some corner case bug.
The way I'd proceed would be complicated. The thing is that the GC thread sends a suspend signal to all other thread and wait for every thread to say hey I'm stopping, but somehow some threads fail to respond. So either the suspend signal is lost or ignored. We could put some debug stub in the signal handler regarding the suspend signal and see which thread fails to react. If I have time I'd set up VM to run OpenBSD to delve into by myself, but I don't see I can do that in near future...
When I was testing it with it reduced test it only ran 2 threads. I also run the tests with GC_RETRY_SIGNALS flag so it sends the suspend signal multiple times but it still didn't proceed. So I'd say the signal is not lost, most likely ignored. I'll try to get meaningful backtrace from the threads here if they would give some clue whats causing the issue.
Ok, run the test in gdb, it got stuck. I interrupted it and run "thread apply all bt": debug.txt
Hmm, so thread 1 seems to init the garbage collection from Scm_Cons but in thread 2 Scm_Acons takes the GC_lock and they wait for each other?
Thanks, it does seem ignored signal case. Thread 1 already got the lock, sending out suspend signal to other thread, and wait on the semaphore for everybody to stop. Thread 2 somehow oblivious to that and naively tries to get the lock, find it's already locked, and spins on it. The fact that GC_RETRY_SIGNALS didn't work suggests we're blocking the suspend signal?
Making issue out of this so its easier to track progress. This was reported on the mailing list a while back. Gauche 0.9.5 doesn't currently work on OpenBSD, it compiles but tests won't pass.
configured as:
Gauche compiles but tests get stuck on "testing threads" in the "atom" tests part.
OpenBSD facts:
For one thing, Gauche src/signal.c lists
Considering OpenBSD uses that for thread suspend signal shouldn't it be:
Also, as OpenBSD thread implementation rthreads seems to use SIGTHR but Gauche doesn't seem to include that in src/signal.c file at all. Should Gauche include:
With above changes Gauche compiles but tests pass until they lock on "system" test. This hangs so bad I can't abort the test without manually killing gosh process. If I skip that test, I get all the way to "Testing 'control'" test which hangs and I need to interrupt it. The thread tests pass most of the time, like ~9/10 times but test always stuck on "control" test.
The GC tests pass just fine.