Closed DrewDevereux closed 3 years ago
Hi @DrewDevereux
Thanks for the suggestion. I doesn't solve the problem though. From my previous digging, it is in the cppTango layer. There is some global state that isn't cleaned up/re-initialised correctly which causes the seg fault.
The workaround are using process=True
, as you know. Or using a multiprocess runner, like pytest --forked
, to isolate each test. However, we can still only start one Tango server per process.
A simple way to cause the problem (which also fails with the code changes in this PR):
test_seg_fault.py
import tango
from tango.server import Device
from tango.test_context import DeviceTestContext
with DeviceTestContext(Device, process=False):
pass
with DeviceTestContext(Device, process=False): # this will crash
pass
Trying to launch the second device causes the seg fault when the server is starting up.
PyTango util.server_init() -> PyTango extension PyUtil::server_init() -> ... -> cppTango cppapi/server/dserver.h:117.
The admin_device
is null in Util::polling_configure
.
I think we need the cppTango experts to come to the rescue!
More detailed stack trace (from a Conda environment):
(env-py3.8-tango9.3.4) root@75b9da6a7ea6:/opt/project/tests# gdb --args python test_seg_fault.py
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) r
Starting program: /opt/conda/envs/env-py3.8-tango9.3.4/bin/python test_test_context.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9205]
[New Thread 0x7ff4a65a8700 (LWP 9206)]
[New Thread 0x7ff4a5da7700 (LWP 9207)]
[New Thread 0x7ff4a5591700 (LWP 9208)]
[New Thread 0x7ff4a4d90700 (LWP 9209)]
Can't create notifd event supplier. Notifd event not available
[New Thread 0x7ff497fff700 (LWP 9210)]
[New Thread 0x7ff4977fe700 (LWP 9211)]
[New Thread 0x7ff496ffd700 (LWP 9212)]
Ready to accept request
Ready to accept request
[New Thread 0x7ff4967fc700 (LWP 9213)]
[Thread 0x7ff496ffd700 (LWP 9212) exited]
[Thread 0x7ff497fff700 (LWP 9210) exited]
[Thread 0x7ff4977fe700 (LWP 9211) exited]
[Thread 0x7ff4967fc700 (LWP 9213) exited]
[Thread 0x7ff4a5591700 (LWP 9208) exited]
[Thread 0x7ff4a4d90700 (LWP 9209) exited]
[New Thread 0x7ff4a4d90700 (LWP 9214)]
[Thread 0x7ff4a65a8700 (LWP 9206) exited]
Thread 10 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff4a4d90700 (LWP 9214)]
0x00007ff4b5cb3182 in Tango::DServer::get_poll_th_pool_size (this=0x0) at /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/dserver.h:117
117 /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/dserver.h: No such file or directory.
(gdb) bt
#0 0x00007ff4b5cb3182 in Tango::DServer::get_poll_th_pool_size (this=0x0) at /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/dserver.h:117
#1 0x00007ff4b5cb32c0 in Tango::Util::polling_configure (this=0x7ff4a00016d0) at /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/utils_polling.cpp:65
#2 0x00007ff4b5cad126 in Tango::Util::server_init (this=0x7ff4a00016d0, with_window=false) at /usr/local/src/conda/cpptango-9.3.4rc6/cppapi/server/utils.cpp:1875
#3 0x00007ff4b65e93ff in PyUtil::server_init(Tango::Util&, bool) () from /opt/project/tango/_tango.cpython-38-x86_64-linux-gnu.so
#4 0x00007ff4b65e7d43 in boost::python::objects::caller_py_function_impl<boost::python::detail::caller<void (*)(Tango::Util&), boost::python::default_call_policies, boost::mpl::vector2<void, Tango::Util&> > >::operator()(_object*, _object*) () from /opt/project/tango/_tango.cpython-38-x86_64-linux-gnu.so
...
#102 0x000055a5aecd2db1 in PyVectorcall_Call (callable=0x7fe6f2eda800, tuple=<optimized out>, kwargs=<optimized out>) at /tmp/build/80754af9/python_1596615979580/work/Objects/call.c:199
#103 0x000055a5aedcb5de in t_bootstrap (boot_raw=0x7fe6f2e50d20) at /tmp/build/80754af9/python_1596615979580/work/Modules/_threadmodule.c:1002
#104 0x000055a5aed783e8 in pythread_wrapper (arg=<optimized out>) at /tmp/build/80754af9/python_1596615979580/work/Python/thread_pthread.h:232
#105 0x00007fe7035b2fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#106 0x00007fe7034e34cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb)
Hi @ajoubertza , just a courtesy message to say thanks for the review. I intend to follow up on this PR but it might be a few weeks before I have time.
Trying to launch the second device causes the seg fault when the server is starting up.
Hi,
cppTango is not cleaning up many things after the server is shut down, especially all the singleton instances are left intact. Attempting to re-use them (Util::instance()
and the like) will return broken or half-deinitialized objects. I guess this use case was never considered in cppTango design. Currently after server returns from the main loop, the process should exit.
You may attempt to work-around this by resetting things. I was able to re-initialize a simple c++ device server without any crashes with the following code (it's just a PoC, note that there may be all sorts of bugs and memory leaks and it may not work with different cppTango version due to access to the private variables).
#include <tango.h>
#include <dserverclass.h>
#include <dserversignal.h>
#include <eventsupplier.h>
void reset_util();
void reset_dserver();
void reset_dserver_signal();
void reset_zmq_event_supplier();
int main(int argc, char** argv)
{
Tango::Util *util = Tango::Util::init(argc, argv);
util->server_init(false);
util->server_run();
// We need to reset at least these ...
delete util;
reset_util();
reset_dserver();
reset_dserver_signal();
reset_zmq_event_supplier();
// Another server loop
util = Tango::Util::init(argc, argv);
util->server_init(false);
util->server_run();
return 0;
}
template <auto ptr>
struct ResetUtil {
friend void reset_util() {
*ptr = nullptr;
}
};
template <auto ptr>
struct ResetDServer {
friend void reset_dserver() {
*ptr = nullptr;
}
};
template <auto ptr>
struct ResetDServerSignal {
friend void reset_dserver_signal() {
delete (**ptr).sig_th;
*ptr = nullptr;
}
};
template <auto ptr>
struct ResetZmqEventSupplier {
friend void reset_zmq_event_supplier() {
*ptr = nullptr;
}
};
template struct ResetUtil<&Tango::Util::_instance>;
template struct ResetDServer<&Tango::DServerClass::_instance>;
template struct ResetDServerSignal<&Tango::DServerSignal::_instance>;
template struct ResetZmqEventSupplier<&Tango::ZmqEventSupplier::_instance>;
Thanks for the detailed example, @mliszcz. I guess we can try it out on the PyTango extension and see what happens. Maybe if it works well, we can consider adding some reset methods to cppTango.
Thanks for the detailed example, @mliszcz. I guess we can try it out on the PyTango extension and see what happens. Maybe if it works well, we can consider adding some reset methods to cppTango.
The general idea is to destroy all the singleton instances (by calling destructor) and then reset the static member that points to the instance (because only then we will initialize them again). The problem is that sometimes the destructors are not defined or do not release resources allocated in constructor (e.g. DServerSignal does not stop signal handling thread). I just had one more look at the implementation and it seems that signal handler thread simply cannot be stopped as it lacks any exit handling.
I agree that the proper solution would be to change cppTango to clean-up its resources. But this most likely will break ABI and we can have that only in 9.4. Also, this cleanup would be hard to get right even by changing cppTango directly. Doing this from PyTango seems possible but even harder.
I think there are separate issues of
I don't have time to chase that new bug right now, let alone progress this, so I will close this PR.
Per https://pytango.readthedocs.io/en/stable/howto.html#using-clients-with-multithreading
and in the documentation of
tango.test_context.MultiDeviceTestContext
, it says:From personal experience these segfaults in thread mode are quite common, to the extent that there isn't a lot of point in running
MultiDeviceTestContext
withprocess=False
.In thread mode,
MultiDeviceTestContext
uses python'sthreading
module, and does not use the OmniThread work-around described above. Logic suggests that the segfaults that occur in threading mode might be eliminated by the workaround.This merge request implements the OmniThread workaround into
MultiDeviceTestContext
.The only detail to be noted is:
Unfortunately I don't know how to add a unit test for this.
I can't even test it informally: I have code that is currently segfaulting when (and only when)
MultiDeviceTestContext
is inprocess=False
mode. But when I spin up a docker image with editable installs of both tango and my repo, my code stubbornly refuses to segfault in there.