ShellUTest errors sometimes with Debug build

ferrouswheel commented 6 years ago

I get sporadic failures with ShellUTest, but have only noticed them with a Debug build.

Compiler is standard for Ubuntu 16.04 gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

Maybe related to opencog/opencog#2542 ? Though a cause was found and fixed in that issue.

The output is:

Start  5: ShellUTest
*** Error in `/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest': corrupted size vs. prev_size: 0x00007f7b38000a60 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f7b5db067e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7f7b5db0d9dc]
/lib/x86_64-linux-gnu/libc.so.6(+0x81cde)[0x7f7b5db10cde]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f7b5db13184]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_doallocate+0x55)[0x7f7b5dafc1d5]
/lib/x86_64-linux-gnu/libc.so.6(_IO_doallocbuf+0x34)[0x7f7b5db0a594]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_underflow+0x1fc)[0x7f7b5db0969c]
/lib/x86_64-linux-gnu/libc.so.6(_IO_default_uflow+0xe)[0x7f7b5db0a60e]
/lib/x86_64-linux-gnu/libc.so.6(_IO_getline_info+0xaa)[0x7f7b5dafdc6a]
/lib/x86_64-linux-gnu/libc.so.6(fgets+0xad)[0x7f7b5dafcb7d]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_Z8cmd_execB5cxx11PKc+0x100)[0x4200a4]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_Z11gargleblastii+0x9f)[0x4201e3]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt12_Bind_simpleIFPFviiEiiEE9_M_invokeIILm0ELm1EEEEvSt12_Index_tupleIIXspT_EEE+0x5a)[0x42faea]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt12_Bind_simpleIFPFviiEiiEEclEv+0x1d)[0x42f945]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt6thread5_ImplISt12_Bind_simpleIFPFviiEiiEEE6_M_runEv+0x1c)[0x42f842]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f7b5e344c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f7b5de606ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f7b5db9641d]
======= Memory map: ========
<snip>

linas commented 6 years ago

That backtrace is a typical "double free or malloc pool corruption" bug. It happens because the dependencies between the various different shared libs in the atomspace (and/or in opencog) are messed up. what's happening is that various shlib dtors are run twice (instead of once) because ... something somewhere gets confused. For example, deleting the cogserver might sometimes cause this. Deleting the attentionbank sometimes causes this.

The shared library dtors that run twice are usually the ones that are creating the atom types .. something causes the atom types to be deleted twice. You can see this directly, by putting a printf in the dtor, and watch it get printed twice. or a gdb breakpoint. you can view both stack traces, and say to yourself that they both look perfectly valid, perfectly legit ways to get to the dtor, and so you have to poke around to see why ... and its usually something goofy like deleteing cogserver causes atomspace to be deleted causes attentionbank to be deleted causes the atomspace to be deleted a second time ... stuff like that. Its done by the dtors, because there are global singleton shared pointers that hold these objects, which means they cannot get deleted until the shared library is unloaded .. but by the time the shared library is unloaded, the methods on the C++ objects are already gone (because some other needed shared lib has been unloaded...) .. circular dependency hell.

Its kind of tedious to track these down, and the (ugly) work-around is to just not delete things, and let glibc do the cleanup itself on process exit. In short, problems with the shared lib plumbing. Best handled by someone with a lot of spare time on their hands, and lots of linker/loader experience (or who wants to learn how the linker/loader works, which is actually a rather large hairy OS subsystem -- so not for the weak of heart).

linas commented 6 years ago

Why it's intermittent, I'm not sure. its possible (for example) that the cogserver gets one more command over the network, and so it thinks it has one more thing to do, but the atomspace is already gone. I thought I'd fixed all of these, by making sure that all network connections are closed, all command buffers flushed, and all threads are done and exited, before allowing the cogserver to shut down. But I dunno. Maybe I missed a spot. Again -- its really hairy code.

linas commented 6 years ago

Things like reworking the agent-runner/system activity table could trigger this, e.g. because you are holding a pointer to a system activity table when the singleton has already been deleted. Stuff like that. Did you ever see this bug, before reworking the system activity table code?

ngeiswei commented 6 years ago

I have so far observed 3 types of failures

From testMultiStream

terminate called after throwing an instance of 'concurrent_queue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>::Canceled'
what():  std::exception

From testDrain

ShellUTest.cxxtest:78: Error: Expected (281 <= reso.size()), found (281 > 280)

From testControlC
```
ShellUTest.cxxtest:265: Error: Expected (end - start < 2), found (42 >= 2)
```
I've only observed that last one from CircleCI, never locally, but it's the most frequent cause of failure there unfortunately.

linas commented 6 years ago

Error 1 happens if some thread attempts to use the queue, after it has been canceled. The queue is canceled when the network connection is being closed. Guess: maybe there is some race, with more traffic coming across the network, even though its being shut down.

Error 2: given the sizes, it would seem that the final carriage-return was not delivered. Why? Dunno. Another race.

Error 3: The control-C message should be two bytes long. Why is it 42 bytes long? Dunno .. that's a lot of bytes that should not be there. Maybe they are an error message? Bytes that should have gone to someone else? Printing them out would provide a clue.

linas commented 6 years ago

You might think that someone out there would have written a generic shell server, that would be fast, bullet-proof, support login passwords and ssh, etc. Then we could replace the cogserver by that. But I was never able to find such a thing.

ngeiswei commented 6 years ago

Not sure it's related but I noticed that ServerSocket::SetCloseAndDelete() doesn't always have time to complete because the CogServer shuts down (leading the unit test to halt) before that.

ngeiswei commented 6 years ago

OK, I understand why ServerSocket::SetCloseAndDelete() doesn't always complete, it is detached from the NetworkServer at its creation, see NetworkServer::listen()

        std::thread(&ConsoleSocket::handle_connection, ss).detach();

then when the connection stops, the while loop in ServerSocket::handle_connection() breaks, and it eventually calls its own destruction

    delete this;

So if the program ends before that line is completed it never gets fully closed.

I have no idea however if that could be a problem, and no idea whether it is related to that issue.

I've added a sleep(1) in ShellUTest::tearDown() and obviously now ServerSocket::SetCloseAndDelete() always completes, however I don't know if that is doing anything to fix the bug (ran 27 times without crash, but it could be just luck since it occurs rarely).

linas commented 6 years ago

Thanks @ngeiswei !! That's a very plausible explanation. The code was carefully crafted to try to avoid exactly this kind of race, but it is sufficiently complicated that perhaps a spot was missed. There locks and flags and semaphores that get set and double-checked, to prevent the out-of-order shutdown.

The only way I know of debugging this is decidedly old-school: put prints or loggers or traces at all the places that should be passed through, during orderly shutdown, and then decode what went out-of-order when it doesn't.

linas commented 6 years ago

The cogserver destructor is supposed to wait for the network server destructor to finish. The network server destructor is supposed to wait for all threads to join. The threads are not supposed to join, until they are all done. So, in principle, everything should have waited for ServerSocket to finish. ...