enableWorkerThreads : crash upon SIGINT

mihails-strasuns commented 10 years ago

Using trivial listener app:

import vibe.http.server;
import vibe.core.core : enableWorkerThreads;

void handleRequest(HTTPServerRequest req, HTTPServerResponse res)
{
    res.writeBody("Hello", "text/plain");
}

shared static this()
{   
    enableWorkerThreads();
    listenHTTP(new HTTPServerSettings(), &handleRequest);
}

Crashes after hitting Ctrl+C (Arch Linux x64).

Backtrace:

Program terminated with signal 11, Segmentation fault.
#0  0x000000000061fb94 in vibe.utils.hashmap.__T7HashMapTC4core6thread6ThreadTS4vibe4core7drivers9libevent210ThreadSlotZ.HashMap.resize() (this=0x7f2d54ac5ba8, k=0x0) at ../../../../../.dub/packages/vibe-d-master/source/vibe/utils/hashmap.d:205
205                 m_hasher = k => k.toHash();
(gdb) bt
#0  0x000000000061fb94 in vibe.utils.hashmap.__T7HashMapTC4core6thread6ThreadTS4vibe4core7drivers9libevent210ThreadSlotZ.HashMap.resize() (this=0x7f2d54ac5ba8, k=0x0) at ../../../../../.dub/packages/vibe-d-master/source/vibe/utils/hashmap.d:205
#1  0x0000000000661872 in vibe.utils.hashmap.__T7HashMapTC4core6thread6ThreadTS4vibe4core7drivers9libevent210ThreadSlotZ.HashMap.findIndex() (this=0x7f2d54ac5ba8, key=0x0) at ../../../../../.dub/packages/vibe-d-master/source/vibe/utils/hashmap.d:165
#2  0x000000000061f882 in vibe.utils.hashmap.__T7HashMapTC4core6thread6ThreadTS4vibe4core7drivers9libevent210ThreadSlotZ.HashMap.__T13opBinaryRightVAyaa2_696eZ.opBinaryRight() (this=0x7f2d54ac5ba8, key=0x0) at ../../../../../.dub/packages/vibe-d-master/source/vibe/utils/hashmap.d:121
#3  0x000000000065e0e5 in vibe.core.drivers.libevent2.Libevent2ManualEvent.amOwner() (this=0x7f2d54ac5b80) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/drivers/libevent2.d:511
#4  0x000000000065d985 in vibe.core.drivers.libevent2.Libevent2ManualEvent.wait() (this=0x7f2d54ac5b80, reference_emit_count=2) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/drivers/libevent2.d:449
#5  0x000000000065d910 in vibe.core.drivers.libevent2.Libevent2ManualEvent.wait() (this=0x7f2d54ac5b80) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/drivers/libevent2.d:444
#6  0x0000000000681ada in vibe.core.core.exitEventLoop() (shutdown_workers=true) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/core.d:113
#7  0x0000000000683c81 in onSignal (signal=2) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/core.d:101
#8  <signal handler called>
#9  0x00007f2d53527a73 in epoll_wait () from /usr/lib/libc.so.6
#10 0x00007f2d545af793 in ?? () from /usr/lib/libevent-2.0.so.5
#11 0x00007f2d5459bae7 in event_base_loop () from /usr/lib/libevent-2.0.so.5
#12 0x000000000065ae0b in vibe.core.drivers.libevent2.Libevent2Driver.runEventLoop() (this=0x7f2d54ac5e00) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/drivers/libevent2.d:141
#13 0x00000000006818f2 in vibe.core.core.runEventLoop() () at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/core.d:84
#14 0x00000000006a2716 in D main () at ../../../../../.dub/packages/vibe-d-master/source/vibe/appmain.d:54
#15 0x000000000072a304 in rt.dmain2._d_run_main() ()
#16 0x000000000072a25e in rt.dmain2._d_run_main() ()
#17 0x000000000072a2c4 in rt.dmain2._d_run_main() ()
#18 0x000000000072a25e in rt.dmain2._d_run_main() ()
#19 0x000000000072a1df in _d_run_main ()
#20 0x00000000006a2765 in main ()

mihails-strasuns commented 10 years ago

Looks like worker threads are currently broken completely. For a simple application just enabling this functionality makes server hang after completing initial TCP handshake with no data written.

s-ludwig commented 10 years ago

I've set up an Arch VM and can't reproduce it - both Ctrl+C and making requests work fine. It also works on Windows and Ubuntu.

mihails-strasuns commented 10 years ago

Can reliably reproduce it on both native desktop installation and with virtualbox instance. Both 0.7.18 and master. dmd / dub installed from repository. What was your VM setup?

s-ludwig commented 10 years ago

Hm ok, did you try with vibe.d master? There have been a couple of fixes since then. I'll try 0.7.18 to be sure it's nothing system specific.

mihails-strasuns commented 10 years ago

Both 0.7.18 and master

s-ludwig commented 10 years ago

Oh sorry, overlooked that. So that's strange then. I'm testing on a fresh system: Up-to-date Arch Linux x86-64, DMD 2.064.2, vibe.d master, DUB master, everything "dub upgrade"d - running on VMware. Tried with your example code, as well as the "bench-http-server" example.

mihails-strasuns commented 10 years ago

Did you install dmd using pacman or used one from .zip? Another obvious difference is VMWare vs VirtualBox. It must be something in environment but don't see anything weird at first glance. I'll investigate actual crash dumps in more details but that will take some time.

s-ludwig commented 10 years ago

The one from the zip file. I guess VMware vs. Virtual Box shouldn't make a difference on this level, but who knows.

mihails-strasuns commented 10 years ago

Smells bad. Will retest on binary from .zip shortly.

mihails-strasuns commented 10 years ago

Phew, luckily it still crashes :)

mihails-strasuns commented 10 years ago

Ok, will report after more detailed investigation.

mihails-strasuns commented 10 years ago

Looks like corrupted fiber switch - all core dumps have garbage in registers like this:

(gdb) info registers 
rax            0x7f3081b70ba8   139846311414696
rbx            0x7f3081b70b80   139846311414656
rcx            0x7f3081b70ba8   139846311414696
rdx            0x61fc28 6421544
rsi            0x0  0
rdi            0x0  0
rbp            0x7fff9e1310b0   0x7fff9e1310b0
rsp            0x7fff9e1310a0   0x7fff9e1310a0
r8             0x7f3081b70b10   139846311414544
r9             0x1  1
r10            0x0  0
r11            0x246    582
r12            0x28e4214    42877460
r13            0x28e4208    42877448
r14            0x28e4d18    42880280
r15            0xffffffff   4294967295
rip            0x61fc3c 0x61fc3c <vibe.core.file.getFileInfo()+276>
eflags         0x10206  [ PF IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0

Will need to do a very detailed investigation of event loop / fiber code to proceed so this is likely to be put on pause for now =/

etcimon commented 10 years ago

libevent2.d:427

~this()
{
    foreach (ts; m_waiters)
        event_free(ts.event);
}

Should be

~this(){
    onThreadShutdown();
}

s-ludwig commented 10 years ago

Unfortunately it can't, because the class is GC managed. This means that any other GC references (including the mutex in particular) might already be finalized before the destructor is called. Also, locking a mutex in a destructor is generally useless because any competing thread will access the already destructed (and possibly freed) object after the mutex is unlocked, still resulting in a crash. It needs to be made sure in the higher level code that no one accesses the object anymore when its destructor is called.

In this particular case (the call stack above) it also doesn't make sense for the destructor to be called, because the ManualEvent is still referenced by the global vibe.core.core.st_workerTaskSignal and the main thread also owns it. However, Libevent2Driver.registerObject and unregisterObject look suspicious. m_ownedObjects has to be protected by the mutex, because Libevent2Object.~this may be called by the GC in a different thread than is was created (which is bad, but what can you do).

I'll setup the VM with multiple cores, thinking about it, that seems to be the most likely cause for the differences.

s-ludwig commented 10 years ago

Still can't get it, even with multiple cores. I'll push a fix for the (un)register methods.

mihails-strasuns commented 10 years ago

Using master with that fix I was reliably getting same backtrace with crash on m_hasher = k => k.toHash(); becase k is null. Registers are still a bit weird but that is likely to be just a coincidence. After hacking that to be m_hasher = k => k ? k.toHash() : 0, got triggered assert in libevent2.d:475 : assert(!amOwner());

Any suggestions?

s-ludwig commented 10 years ago

I've understood the issue now and will prepare a fix. Windows and Linux work very differently - Windows in particular currently "sidesteps" the issue, because the break signal arrives in a special thread where s_eventLoopRunning is naturally false and exit(1) is used to terminate the process.

On Linux, several issue are working together. First, the default hasher in HashMap for class and pointer types needs to be adjusted to cope with null values. Second, Libevent2ManualEvent.wait() currently can't handle being called from outside of a Task. Third (not happening here), exitEventLoop doesn't shut down the main event loop when called from a non-main thread - probably not what is desired. I still can't reproduce the crash for partially unknown reasons, but that fortunately shouldn't matter anymore.

So basically, the thread shutdown procedure needs to be adjusted to work from any thread and to use a core.sync.condition.Condition instead of Libevent2ManualEvent. Improving Libevent2ManualEvent.wait() would be even better, but that is a lot more involved.

mihails-strasuns commented 10 years ago

Have tested the fix. Good stuff : it does not crash. Bad stuff : it hangs forever upon exit :) Though AFAIR you have been mentioning this as some inherent libevent2 issue, is it still true?

mihails-strasuns commented 10 years ago

Though it seem to ignore even SIGTERM now, only SIGKILL / SIGABRT work.

s-ludwig commented 10 years ago

That sucks! I need to somehow be able to reproduce that. For me it shuts down in a perfectly clean way... Could there be any setting that influences how things are handled for your OS installation? Otherwise, you could try to put debug log statements on (almost) each line of onSignal, exitEventLoop and runEventLoop so that the thread interactions become visible (with setLogFormat(FileLogger.Format.threadTime).

s-ludwig commented 10 years ago

BTW, SIGINT and SIGTERM have been handled the same before, too. Did SIGINT crash for you, but SIGTERM didn't?

mihails-strasuns commented 10 years ago

Pretty much default Arch x64 virtualbox instance here, 1 CPU setup. Only additional packages installed from repos are git and all dlang stuff so can't imagine any special local configuration / overly clever daemons. I will provide extra trace information once hit some pause at job.

BTW, SIGINT and SIGTERM have been handled the same before, too. Did SIGINT crash for you, but SIGTERM didn't?

I didn't check SIGTERM before. Have noticed that only when trying to kill process. Btw it does print Received signal XX. Shutting down., however nothing observable happens after it.

s-ludwig commented 10 years ago

Wait a minute. Using SIGTERM just outputs that the signal was received a bunch of times and then hangs for me, too. I'm investigating.

s-ludwig commented 10 years ago

There have still been two issues (one shared variable which was thought to be thread local and the signal handler was taking too long to finish). The code is now also written to unify handling across worker threads and normal threads, which makes it easier to understand. In the current state everything works reliably for me using either "kill" or Ctrl+C.

mihails-strasuns commented 10 years ago

Yay, works for me! You can call yourself "telepathy debugging master" now :) Thanks, that was partially blocking my attempt to create automatic perf tests for vibe.d, back to it.

mihails-strasuns commented 10 years ago

P.S. This has also fixed issue with hanging requests with worker threads / distribution enabled.

etcimon commented 10 years ago

That's some purely awesome multi-thread handling with atomics. I'm always impressed by it because it's so modern

s-ludwig commented 10 years ago

Phew! :)

Okay, now it's time for me to finally take a closer look at the session topics!

vibe-d / vibe.d

enableWorkerThreads : crash upon SIGINT #419