Open ferrouswheel opened 6 years ago
That backtrace is a typical "double free or malloc pool corruption" bug. It happens because the dependencies between the various different shared libs in the atomspace (and/or in opencog) are messed up. what's happening is that various shlib dtors are run twice (instead of once) because ... something somewhere gets confused. For example, deleting the cogserver might sometimes cause this. Deleting the attentionbank sometimes causes this.
The shared library dtors that run twice are usually the ones that are creating the atom types .. something causes the atom types to be deleted twice. You can see this directly, by putting a printf in the dtor, and watch it get printed twice. or a gdb breakpoint. you can view both stack traces, and say to yourself that they both look perfectly valid, perfectly legit ways to get to the dtor, and so you have to poke around to see why ... and its usually something goofy like deleteing cogserver causes atomspace to be deleted causes attentionbank to be deleted causes the atomspace to be deleted a second time ... stuff like that. Its done by the dtors, because there are global singleton shared pointers that hold these objects, which means they cannot get deleted until the shared library is unloaded .. but by the time the shared library is unloaded, the methods on the C++ objects are already gone (because some other needed shared lib has been unloaded...) .. circular dependency hell.
Its kind of tedious to track these down, and the (ugly) work-around is to just not delete things, and let glibc do the cleanup itself on process exit. In short, problems with the shared lib plumbing. Best handled by someone with a lot of spare time on their hands, and lots of linker/loader experience (or who wants to learn how the linker/loader works, which is actually a rather large hairy OS subsystem -- so not for the weak of heart).
Why it's intermittent, I'm not sure. its possible (for example) that the cogserver gets one more command over the network, and so it thinks it has one more thing to do, but the atomspace is already gone. I thought I'd fixed all of these, by making sure that all network connections are closed, all command buffers flushed, and all threads are done and exited, before allowing the cogserver to shut down. But I dunno. Maybe I missed a spot. Again -- its really hairy code.
Things like reworking the agent-runner/system activity table could trigger this, e.g. because you are holding a pointer to a system activity table when the singleton has already been deleted. Stuff like that. Did you ever see this bug, before reworking the system activity table code?
I have so far observed 3 types of failures
testMultiStream
terminate called after throwing an instance of 'concurrent_queue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>::Canceled'
what(): std::exception
testDrain
ShellUTest.cxxtest:78: Error: Expected (281 <= reso.size()), found (281 > 280)
testControlC
ShellUTest.cxxtest:265: Error: Expected (end - start < 2), found (42 >= 2)
I've only observed that last one from CircleCI, never locally, but it's the most frequent cause of failure there unfortunately.
Error 1 happens if some thread attempts to use the queue, after it has been canceled. The queue is canceled when the network connection is being closed. Guess: maybe there is some race, with more traffic coming across the network, even though its being shut down.
Error 2: given the sizes, it would seem that the final carriage-return was not delivered. Why? Dunno. Another race.
Error 3: The control-C message should be two bytes long. Why is it 42 bytes long? Dunno .. that's a lot of bytes that should not be there. Maybe they are an error message? Bytes that should have gone to someone else? Printing them out would provide a clue.
You might think that someone out there would have written a generic shell server, that would be fast, bullet-proof, support login passwords and ssh, etc. Then we could replace the cogserver by that. But I was never able to find such a thing.
Not sure it's related but I noticed that ServerSocket::SetCloseAndDelete()
doesn't always have time to complete because the CogServer shuts down (leading the unit test to halt) before that.
OK, I understand why ServerSocket::SetCloseAndDelete()
doesn't always complete, it is detached from the NetworkServer
at its creation, see NetworkServer::listen()
std::thread(&ConsoleSocket::handle_connection, ss).detach();
then when the connection stops, the while loop in ServerSocket::handle_connection()
breaks, and it eventually calls its own destruction
delete this;
So if the program ends before that line is completed it never gets fully closed.
I have no idea however if that could be a problem, and no idea whether it is related to that issue.
I've added a sleep(1)
in ShellUTest::tearDown()
and obviously now ServerSocket::SetCloseAndDelete()
always completes, however I don't know if that is doing anything to fix the bug (ran 27 times without crash, but it could be just luck since it occurs rarely).
Thanks @ngeiswei !! That's a very plausible explanation. The code was carefully crafted to try to avoid exactly this kind of race, but it is sufficiently complicated that perhaps a spot was missed. There locks and flags and semaphores that get set and double-checked, to prevent the out-of-order shutdown.
The only way I know of debugging this is decidedly old-school: put prints or loggers or traces at all the places that should be passed through, during orderly shutdown, and then decode what went out-of-order when it doesn't.
The cogserver destructor is supposed to wait for the network server destructor to finish. The network server destructor is supposed to wait for all threads to join. The threads are not supposed to join, until they are all done. So, in principle, everything should have waited for ServerSocket
to finish. ...
Testing, I've got zero failures on 30 repeats over 90 minutes of testing. This is with boost-1.62
I'm on Debian stable, gcc --version gcc (Debian 8.3.0-6) 8.3.0
and ldd --version ldd (Debian GLIBC 2.28-10) 2.28
and bost version 1.62
I'm wondering if it's a glibc bug of a boost:asio bug.
5 failures out of 22 repeats with boost-1.67 -- so I guess its a boost bug. will double-check for boost-1.62 ... crap. Debian no longer has boost 1.62 ... Can't test.
the boost-1.67 crashes include at least 4 different kinds: corrupted size vs. prev_size
and Segmentation fault
and double free detected in tcache
and corrupted size vs. prev_size
Seems that Debian ONLY has boost-1.67 and without some easy way of getting back to a working version of boost, I'm not sure what to do ...
Presumably, its a boost::asio bug .. boost::asio has been a headache for 5+ years ... the easiest solution would be to re-write the code to just not use it ... which maybe isn't hard ...?
OK, as a boost asio replacement, there is just plain-old asio: in Debian, libasio-dev
otherwise https://think-async.com/Asio/ .. but it seems like boost asio is just a copy/fork of this. Given how buggy and unstable asio has been over the 5+ years, this feels like the wrong direction to go in...
If we instead assume that asio is not buggy, and that, instead, it is throwing some new kind of error that we are not yet handling, then network/ServerSocket.cc
is the place to look. It already handles nine different kinds of asio errors, maybe there's a new tenth kind of error ...
Is this this issue still actual?
Well, boost asio is certainly still as buggy as ever. Debian-stable is still on boost-1.67 .... I'd like to see all uses of boost in opencog removed, but its a long, hard task...
I mean does this issue repeat? I don't see it for a long time.
Well, in comment https://github.com/opencog/cogserver/issues/5#issuecomment-560156702 I put it into a loop and saw 5 crashes out of 22 runs. Nothing has changed since then ... I can try the test again. Here's my shell script:
#! /bin/bash
date
cnt=0
for i in `seq 1 1000000`;
do
./tests/shell/ShellUTest >>/dev/null
# ./tests/scm/MultiThreadUTest >> /dev/null
# ./tests/rule-engine/backwardchainer/BackwardChainerUTest >> /dev/null
rc=$?
if [ $rc != 0 ]; then
cnt=$((cnt+1))
date
# Halt on the first crash, save the logfile
# exit 1;
fi
#if [ $(($i % 10)) == 0 ]; then
echo duude iter $i result --------- rc = $rc ncrashes = $cnt
#fi
done
I re-ran this, and am getting 8 crashes out of 48 tests, for an average of 1 out of 6 failure rate. This is debian-stable, boost 1.67, gcc 8.3.0 and glibc 2.28
Other intermittent test failures include the pattern-miner SurprisingnessUTest
.. between that and the rule-engine, I have to hit the circle-ci "re-run unit tests" button 2-3-4 times before I can get a run that passes all tests. The intermittent failures are very annoying ...
I get sporadic failures with ShellUTest, but have only noticed them with a Debug build.
Compiler is standard for Ubuntu 16.04
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Maybe related to opencog/opencog#2542 ? Though a cause was found and fixed in that issue.
The output is: