Use OmniThreads in MultiDeviceTestContext

DrewDevereux commented 4 years ago

Per https://pytango.readthedocs.io/en/stable/howto.html#using-clients-with-multithreading

"When performing Tango I/O from user-created threads, there can be problems ... cppTango is currently (version 9.3.3) not directly thread safe when a user is using C++11 standard threads or threads different than omni threads. This lack of thread safety includes threads created from Python’s threading module ... The work-around when using threads which are not omni threads is to create an object of the C++ class omni_thread::ensure_self in the user thread."

and in the documentation of tango.test_context.MultiDeviceTestContext, it says:

Note: if the context will be used multiple times, it may seg fault if the thread mode is chosen.

From personal experience these segfaults in thread mode are quite common, to the extent that there isn't a lot of point in running MultiDeviceTestContext with process=False.

In thread mode, MultiDeviceTestContext uses python's threading module, and does not use the OmniThread work-around described above. Logic suggests that the segfaults that occur in threading mode might be eliminated by the workaround.

This merge request implements the OmniThread workaround into MultiDeviceTestContext.

The only detail to be noted is:

We were already throwing away our traceback object in multiprocessing mode; now we have to throw it away in both modes

Unfortunately I don't know how to add a unit test for this.

I can't even test it informally: I have code that is currently segfaulting when (and only when) MultiDeviceTestContext is in process=False mode. But when I spin up a docker image with editable installs of both tango and my repo, my code stubbornly refuses to segfault in there.

ajoubertza commented 4 years ago

Hi @DrewDevereux

Thanks for the suggestion. I doesn't solve the problem though. From my previous digging, it is in the cppTango layer. There is some global state that isn't cleaned up/re-initialised correctly which causes the seg fault.

The workaround are using process=True, as you know. Or using a multiprocess runner, like pytest --forked, to isolate each test. However, we can still only start one Tango server per process.

A simple way to cause the problem (which also fails with the code changes in this PR):

test_seg_fault.py

import tango
from tango.server import Device
from tango.test_context import DeviceTestContext

with DeviceTestContext(Device, process=False):
    pass
with DeviceTestContext(Device, process=False):  # this will crash
    pass

Trying to launch the second device causes the seg fault when the server is starting up. PyTango util.server_init() -> PyTango extension PyUtil::server_init() -> ... -> cppTango cppapi/server/dserver.h:117. The admin_device is null in Util::polling_configure.

I think we need the cppTango experts to come to the rescue!

More detailed stack trace (from a Conda environment):

(env-py3.8-tango9.3.4) root@75b9da6a7ea6:/opt/project/tests# gdb --args python test_seg_fault.py 
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) r
Starting program: /opt/conda/envs/env-py3.8-tango9.3.4/bin/python test_test_context.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9205]
[New Thread 0x7ff4a65a8700 (LWP 9206)]
[New Thread 0x7ff4a5da7700 (LWP 9207)]
[New Thread 0x7ff4a5591700 (LWP 9208)]
[New Thread 0x7ff4a4d90700 (LWP 9209)]
Can't create notifd event supplier. Notifd event not available
[New Thread 0x7ff497fff700 (LWP 9210)]
[New Thread 0x7ff4977fe700 (LWP 9211)]
[New Thread 0x7ff496ffd700 (LWP 9212)]
Ready to accept request
Ready to accept request
[New Thread 0x7ff4967fc700 (LWP 9213)]
[Thread 0x7ff496ffd700 (LWP 9212) exited]
[Thread 0x7ff497fff700 (LWP 9210) exited]
[Thread 0x7ff4977fe700 (LWP 9211) exited]
[Thread 0x7ff4967fc700 (LWP 9213) exited]
[Thread 0x7ff4a5591700 (LWP 9208) exited]
[Thread 0x7ff4a4d90700 (LWP 9209) exited]
[New Thread 0x7ff4a4d90700 (LWP 9214)]
[Thread 0x7ff4a65a8700 (LWP 9206) exited]

Thread 10 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff4a4d90700 (LWP 9214)]
0x00007ff4b5cb3182 in Tango::DServer::get_poll_th_pool_size (this=0x0) at /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/dserver.h:117
117 /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/dserver.h: No such file or directory.
(gdb) bt
#0  0x00007ff4b5cb3182 in Tango::DServer::get_poll_th_pool_size (this=0x0) at /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/dserver.h:117
#1  0x00007ff4b5cb32c0 in Tango::Util::polling_configure (this=0x7ff4a00016d0) at /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/utils_polling.cpp:65
#2  0x00007ff4b5cad126 in Tango::Util::server_init (this=0x7ff4a00016d0, with_window=false) at /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/utils.cpp:1875
#3  0x00007ff4b65e93ff in PyUtil::server_init(Tango::Util&, bool) () from /opt/project/tango/_tango.cpython-38-x86_64-linux-gnu.so
#4  0x00007ff4b65e7d43 in boost::python::objects::caller_py_function_impl<boost::python::detail::caller<void (*)(Tango::Util&), boost::python::default_call_policies, boost::mpl::vector2<void, Tango::Util&> > >::operator()(_object*, _object*) () from /opt/project/tango/_tango.cpython-38-x86_64-linux-gnu.so
...
#102 0x000055a5aecd2db1 in PyVectorcall_Call (callable=0x7fe6f2eda800, tuple=<optimized out>, kwargs=<optimized out>) at /tmp/build/80754af9/python_1596615979580/work/Objects/call.c:199
#103 0x000055a5aedcb5de in t_bootstrap (boot_raw=0x7fe6f2e50d20) at /tmp/build/80754af9/python_1596615979580/work/Modules/_threadmodule.c:1002
#104 0x000055a5aed783e8 in pythread_wrapper (arg=<optimized out>) at /tmp/build/80754af9/python_1596615979580/work/Python/thread_pthread.h:232
#105 0x00007fe7035b2fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#106 0x00007fe7034e34cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb)

DrewDevereux commented 3 years ago

Hi @ajoubertza , just a courtesy message to say thanks for the review. I intend to follow up on this PR but it might be a few weeks before I have time.

mliszcz commented 3 years ago

Trying to launch the second device causes the seg fault when the server is starting up.

Hi, cppTango is not cleaning up many things after the server is shut down, especially all the singleton instances are left intact. Attempting to re-use them (Util::instance() and the like) will return broken or half-deinitialized objects. I guess this use case was never considered in cppTango design. Currently after server returns from the main loop, the process should exit.

You may attempt to work-around this by resetting things. I was able to re-initialize a simple c++ device server without any crashes with the following code (it's just a PoC, note that there may be all sorts of bugs and memory leaks and it may not work with different cppTango version due to access to the private variables).

#include <tango.h>
#include <dserverclass.h>
#include <dserversignal.h>
#include <eventsupplier.h>

void reset_util();
void reset_dserver();
void reset_dserver_signal();
void reset_zmq_event_supplier();

int main(int argc, char** argv)
{
    Tango::Util *util = Tango::Util::init(argc, argv);
    util->server_init(false);
    util->server_run();

    // We need to reset at least these ...
    delete util;
    reset_util();
    reset_dserver();
    reset_dserver_signal();
    reset_zmq_event_supplier();

    // Another server loop
    util = Tango::Util::init(argc, argv);
    util->server_init(false);
    util->server_run();
    return 0;
}

template <auto ptr>
struct ResetUtil {
    friend void reset_util() {
        *ptr = nullptr;
    }
};

template <auto ptr>
struct ResetDServer {
    friend void reset_dserver() {
        *ptr = nullptr;
    }
};

template <auto ptr>
struct ResetDServerSignal {
    friend void reset_dserver_signal() {
        delete (**ptr).sig_th;
        *ptr = nullptr;
    }
};

template <auto ptr>
struct ResetZmqEventSupplier {
    friend void reset_zmq_event_supplier() {
        *ptr = nullptr;
    }
};

template struct ResetUtil<&Tango::Util::_instance>;
template struct ResetDServer<&Tango::DServerClass::_instance>;
template struct ResetDServerSignal<&Tango::DServerSignal::_instance>;
template struct ResetZmqEventSupplier<&Tango::ZmqEventSupplier::_instance>;

ajoubertza commented 3 years ago

Thanks for the detailed example, @mliszcz. I guess we can try it out on the PyTango extension and see what happens. Maybe if it works well, we can consider adding some reset methods to cppTango.

mliszcz commented 3 years ago

Thanks for the detailed example, @mliszcz. I guess we can try it out on the PyTango extension and see what happens. Maybe if it works well, we can consider adding some reset methods to cppTango.

The general idea is to destroy all the singleton instances (by calling destructor) and then reset the static member that points to the instance (because only then we will initialize them again). The problem is that sometimes the destructors are not defined or do not release resources allocated in constructor (e.g. DServerSignal does not stop signal handling thread). I just had one more look at the implementation and it seems that signal handler thread simply cannot be stopped as it lacks any exit handling.

I agree that the proper solution would be to change cppTango to clean-up its resources. But this most likely will break ABI and we can have that only in 9.4. Also, this cleanup would be hard to get right even by changing cppTango directly. Doing this from PyTango seems possible but even harder.

DrewDevereux commented 3 years ago

I think there are separate issues of

segfaults when creating MDTC twice in a single process
segfaults when using process=False, even when only run once in a process This was targetted at the latter. I tried to create a stress test that would illustrate the targetted problem, and perhaps show whether this PR fixes it, but I ran into a new bug, related to the keepalive thread.

I don't have time to chase that new bug right now, let alone progress this, so I will close this PR.

tango-controls / pytango

Use OmniThreads in MultiDeviceTestContext #387