Closed mihails-strasuns closed 10 years ago
Looks like worker threads are currently broken completely. For a simple application just enabling this functionality makes server hang after completing initial TCP handshake with no data written.
I've set up an Arch VM and can't reproduce it - both Ctrl+C and making requests work fine. It also works on Windows and Ubuntu.
Can reliably reproduce it on both native desktop installation and with virtualbox instance. Both 0.7.18 and master. dmd / dub installed from repository. What was your VM setup?
Hm ok, did you try with vibe.d master? There have been a couple of fixes since then. I'll try 0.7.18 to be sure it's nothing system specific.
Both 0.7.18 and master
Oh sorry, overlooked that. So that's strange then. I'm testing on a fresh system: Up-to-date Arch Linux x86-64, DMD 2.064.2, vibe.d master, DUB master, everything "dub upgrade"d - running on VMware. Tried with your example code, as well as the "bench-http-server" example.
Did you install dmd using pacman
or used one from .zip? Another obvious difference is VMWare vs VirtualBox. It must be something in environment but don't see anything weird at first glance. I'll investigate actual crash dumps in more details but that will take some time.
The one from the zip file. I guess VMware vs. Virtual Box shouldn't make a difference on this level, but who knows.
Smells bad. Will retest on binary from .zip shortly.
Phew, luckily it still crashes :)
Ok, will report after more detailed investigation.
Looks like corrupted fiber switch - all core dumps have garbage in registers like this:
(gdb) info registers
rax 0x7f3081b70ba8 139846311414696
rbx 0x7f3081b70b80 139846311414656
rcx 0x7f3081b70ba8 139846311414696
rdx 0x61fc28 6421544
rsi 0x0 0
rdi 0x0 0
rbp 0x7fff9e1310b0 0x7fff9e1310b0
rsp 0x7fff9e1310a0 0x7fff9e1310a0
r8 0x7f3081b70b10 139846311414544
r9 0x1 1
r10 0x0 0
r11 0x246 582
r12 0x28e4214 42877460
r13 0x28e4208 42877448
r14 0x28e4d18 42880280
r15 0xffffffff 4294967295
rip 0x61fc3c 0x61fc3c <vibe.core.file.getFileInfo()+276>
eflags 0x10206 [ PF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
Will need to do a very detailed investigation of event loop / fiber code to proceed so this is likely to be put on pause for now =/
libevent2.d:427
~this()
{
foreach (ts; m_waiters)
event_free(ts.event);
}
Should be
~this(){
onThreadShutdown();
}
Unfortunately it can't, because the class is GC managed. This means that any other GC references (including the mutex in particular) might already be finalized before the destructor is called. Also, locking a mutex in a destructor is generally useless because any competing thread will access the already destructed (and possibly freed) object after the mutex is unlocked, still resulting in a crash. It needs to be made sure in the higher level code that no one accesses the object anymore when its destructor is called.
In this particular case (the call stack above) it also doesn't make sense for the destructor to be called, because the ManualEvent
is still referenced by the global vibe.core.core.st_workerTaskSignal
and the main thread also owns it. However, Libevent2Driver.registerObject
and unregisterObject
look suspicious. m_ownedObjects
has to be protected by the mutex, because Libevent2Object.~this
may be called by the GC in a different thread than is was created (which is bad, but what can you do).
I'll setup the VM with multiple cores, thinking about it, that seems to be the most likely cause for the differences.
Still can't get it, even with multiple cores. I'll push a fix for the (un)register methods.
Using master with that fix I was reliably getting same backtrace with crash on m_hasher = k => k.toHash();
becase k
is null. Registers are still a bit weird but that is likely to be just a coincidence. After hacking that to be m_hasher = k => k ? k.toHash() : 0
, got triggered assert in libevent2.d:475
: assert(!amOwner());
Any suggestions?
I've understood the issue now and will prepare a fix. Windows and Linux work very differently - Windows in particular currently "sidesteps" the issue, because the break signal arrives in a special thread where s_eventLoopRunning
is naturally false
and exit(1)
is used to terminate the process.
On Linux, several issue are working together. First, the default hasher in HashMap
for class and pointer types needs to be adjusted to cope with null values. Second, Libevent2ManualEvent.wait()
currently can't handle being called from outside of a Task. Third (not happening here), exitEventLoop
doesn't shut down the main event loop when called from a non-main thread - probably not what is desired. I still can't reproduce the crash for partially unknown reasons, but that fortunately shouldn't matter anymore.
So basically, the thread shutdown procedure needs to be adjusted to work from any thread and to use a core.sync.condition.Condition
instead of Libevent2ManualEvent
. Improving Libevent2ManualEvent.wait()
would be even better, but that is a lot more involved.
Have tested the fix. Good stuff : it does not crash. Bad stuff : it hangs forever upon exit :) Though AFAIR you have been mentioning this as some inherent libevent2 issue, is it still true?
Though it seem to ignore even SIGTERM
now, only SIGKILL
/ SIGABRT
work.
That sucks! I need to somehow be able to reproduce that. For me it shuts down in a perfectly clean way... Could there be any setting that influences how things are handled for your OS installation? Otherwise, you could try to put debug log statements on (almost) each line of onSignal
, exitEventLoop
and runEventLoop
so that the thread interactions become visible (with setLogFormat(FileLogger.Format.threadTime
).
BTW, SIGINT
and SIGTERM
have been handled the same before, too. Did SIGINT
crash for you, but SIGTERM
didn't?
Pretty much default Arch x64 virtualbox instance here, 1 CPU setup. Only additional packages installed from repos are git
and all dlang
stuff so can't imagine any special local configuration / overly clever daemons. I will provide extra trace information once hit some pause at job.
BTW, SIGINT and SIGTERM have been handled the same before, too. Did SIGINT crash for you, but SIGTERM didn't?
I didn't check SIGTERM
before. Have noticed that only when trying to kill process. Btw it does print Received signal XX. Shutting down.
, however nothing observable happens after it.
Wait a minute. Using SIGTERM
just outputs that the signal was received a bunch of times and then hangs for me, too. I'm investigating.
There have still been two issues (one shared variable which was thought to be thread local and the signal handler was taking too long to finish). The code is now also written to unify handling across worker threads and normal threads, which makes it easier to understand. In the current state everything works reliably for me using either "kill" or Ctrl+C.
Yay, works for me! You can call yourself "telepathy debugging master" now :) Thanks, that was partially blocking my attempt to create automatic perf tests for vibe.d, back to it.
P.S. This has also fixed issue with hanging requests with worker threads / distribution enabled.
That's some purely awesome multi-thread handling with atomics. I'm always impressed by it because it's so modern
Phew! :)
Okay, now it's time for me to finally take a closer look at the session topics!
Using trivial listener app:
Crashes after hitting Ctrl+C (Arch Linux x64).
Backtrace: