Closed reszelaz closed 4 years ago
Hi Zibi,
Have you tried to call content of the __del__
method (unsubscribe events) explicitly from the thread before __del__
is called? Do you get the same error?
It is not defined when gc calls __del__
so it is better not to place there any serious code.
I've started looking at this. Don't know what the cause of the crash is yet, but I don't think unsubscribe_event
would be executed in Python 2.
The PyTangoDevice
and PyTangoAttribute
create a circular reference. Prior to Python 3.4, unreachable objects with __del__
methods that are in a reference cycle are not released (they are "uncollectable"), so the __del__
method isn't called. Newer versions of Python will release those objects and call __del__
.
Python 2.7
(__del__
not called, and weakref is still alive at the end):
$ python
Python 2.7.6 (default, Nov 13 2018, 12:45:42)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> import weakref
>>>
>>> class PyTangoDevice(object):
... def __init__(self, name):
... print("PyTangoDevice.__init__")
... self._state = self.getAttribute("state")
... def getAttribute(self, name):
... return PyTangoAttribute(name, self)
...
>>> class PyTangoAttribute(object):
... def __init__(self, name, dev):
... print("PyTangoAttribute.__init__")
... self._dev = dev
... def __del__(self):
... print("PyTangoAttribute.__del__")
...
>>> dev = PyTangoDevice('abc')
PyTangoDevice.__init__
PyTangoAttribute.__init__
>>> wr = weakref.ref(dev)
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>> dev = None
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>> gc.collect()
4
>>> gc.garbage
[<__main__.PyTangoAttribute object at 0x7f3842fcb910>]
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>>
Python 3.4
(__del__
is called, and weakref is dead at the end):
$ python3
Python 3.4.3 (default, Nov 12 2018, 22:25:49)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> import weakref
>>>
>>> class PyTangoDevice(object):
... def __init__(self, name):
... print("PyTangoDevice.__init__")
... self._state = self.getAttribute("state")
... def getAttribute(self, name):
... return PyTangoAttribute(name, self)
...
>>> class PyTangoAttribute(object):
... def __init__(self, name, dev):
... print("PyTangoAttribute.__init__")
... self._dev = dev
... def __del__(self):
... print("PyTangoAttribute.__del__")
...
>>> dev = PyTangoDevice('abc')
PyTangoDevice.__init__
PyTangoAttribute.__init__
>>> wr = weakref.ref(dev)
>>> wr
<weakref at 0x7ffa87a6bc28; to 'PyTangoDevice' at 0x7ffa87a7c048>
>>> dev = None
>>> wr
<weakref at 0x7ffa87a6bc28; to 'PyTangoDevice' at 0x7ffa87a7c048>
>>> gc.collect()
PyTangoAttribute.__del__
4
>>> gc.garbage
[]
>>> wr
<weakref at 0x7ffa87a6bc28; dead>
>>>
Hi @reszelaz,
Looking at the backtrace in https://github.com/reszelaz/test-tango-py3 README, it looks like there is an exception thrown when trying to acquire a Tango monitor in your case. (Not able to acquire serialization (dev, class or process) monitor)? Only zmq::error_t exceptions seem to be caught in this case in DelayEvent constructor. We could probably try to catch also errors related to Tango subscription monitor acquisitions as well and it would probably not crash but one question is: why do you get this timeout when trying to acquire this subscription monitor in your case?
First off all thanks @jkotan, @ajoubertza and @bourtemb for looking into this! Also sorry for long silence from my side - this was due to holidays and also rush in finishing the migration of Sardana to Py3.
Have you tried to call content of the del method (unsubscribe events) explicitly from the thread before del is called? Do you get the same error?
I have done explicit unsubscribes and, from my tests, it looks like the problem disappears. You can see the corresponding code in the tango-test-py3/explicit_unsub. This avoids the unsubscription in the __del__
method.
It is not defined when gc calls
__del__
so it is better not to place there any serious code.
Taurus attributes subscribe to configuration events in __init__
and unsubscribe in __del__
. These are singleton objects kept in the factory until someone uses them. Whenever these are not needed should be disposable by the GC. Maybe Python context management could be an alternative for that? but I imagine this to be a bigger refactoring. Anyway, I pass this suggestion to @cpascual so he evaluates it. Thanks!
why do you get this timeout when trying to acquire this subscription monitor in your case?
We observe this problem when iterating over attributes and continuously subscribing and unsubscribing from events. The thing is that the unsubscription is triggered by the garbage collector. I suspect these unsubscriptions somehow collides with the subscriptions of the other attributes.
Thanks @bourtemb and @reszelaz for working on this today.
As promised, here is a summary of our findings:
TAURUS #3 subscribe_event, attr state, thread: <JobThread(Thread-3, started daemon 140624837715712)>
DBG Locking DelayEvent #2: thread: 0
DBG:3795 get_monitor try
DBG:3797 get_monitor ok
TAURUS #4 unsubscribe_event, id 542, thread: <JobThread(Thread-3, started daemon 140624837715712)>
DBG Locking DelayEvent #3: thread: 0x7fe5a8045350
DBG:3795 get_monitor try
DBG: DelayEvent: Not able to acquire serialization (dev, class or process) monitor
Tango exception
Severity = ERROR
Error reason = API_CommandTimedOut
Desc : Not able to acquire serialization (dev, class or process) monitor
Origin : TangoMonitor::get_monitor
DBG Locking DelayEvent #3: thread: 0x7fe5a8045350
subscribe_event is not finished yet but unsubscribe_event is called from Taurus. unsubscribe_event fails to acquire monitor (in DelayEvent). Both subscribe and unsubscribe are called from the same python Thread object. During subscribe, no omni_thread is configured.
Hypothesis 1:
Unsubscribe_event is called at some point due to GC run in a dedicated thread (but incorrect thread name is printed). Maybe the GC thread is starved until monitor timeout expires
Hypothesis 2:
There is a deadlock in the monitor code since it is assuming that the threads are omni-orb threads. Two different non-omni-orb threads are considered by monitor to be the same thread.
Anyway, calling subscribe/unsubscribe_event from non-omni-orb threads (like python threads), is not supported by cppTango monitor (probably we should open an issue for fixing it/printing a warning/documenting).
With below patch I am unable to reproduce the issue. It's not complete, you still should delete the `ensure_self` object when the thread terminates. Proposed solution is to expose similar API in PyTango.
```diff
diff --git a/taurus/PyDsExpClient.py b/taurus/PyDsExpClient.py
index a88a924..7c2c59c 100644
--- a/taurus/PyDsExpClient.py
+++ b/taurus/PyDsExpClient.py
@@ -7,6 +7,20 @@ from taurus.core.tango import TangoDevice
import PyTango
+import cffi
+ffibuilder = cffi.FFI()
+ffibuilder.cdef("void* omni_thread_ensure_self();")
+lib = ffibuilder.verify(r"""
+ #include <omnithread.h>
+ #include <cstdio>
+ void* omni_thread_ensure_self()
+ {
+ return new omni_thread::ensure_self;
+ }
+ """,
+ source_extension='.cpp',
+ libraries=["omnithread"]
+)
DEV_NAME_PATTERN = "test/pydsexp/{}"
@@ -30,6 +44,7 @@ class JobThread(threading.Thread):
self.dev = dev
def run(self):
+ es = lib.omni_thread_ensure_self()
for i in range(100):
if self.dev and self.dev._stop_flag:
break
Thanks to you two for investigating this!
I just tried to run the JobThread.run
method directly in the Start command (without starting a new thread) and the problem disappears. So, if subscribe/unsubscribe are called from the omniorb thread then everything works. I think it also discards the Hypothesis 2, cause if the finalizer __del__
would be called from another thread it would need to crash. Do you agree?
Then a question to the PyTango maintainers would be if we could have a convenient API for calling the omni_thread_ensure_self
?
@bourtemb
Yesterday we talked about adding omni_orb::ensure_self es{}
in cppTango at the top of subscribe_event()
and unsubscribe_event()
. This will result in different omni thread ID assigned for the same physical thread during each call to one of those methods. The problem is that e.g. TangoMonitor relies on omni thread ID to detect re-entrant locking, so we may trigger some unexpected side effects.
How about:
omni_orb::ensure_self
object at global scope in devapi_base.cpp
or event.cpp
or as a static member of DeviceProxy:
namespace Tango
{
thread_local omni_thread::ensure_self global_omni_thread_self{};
// additionally may be declared 'extern' in some header file
// ...
void EventConsumer::unsubscribe_event(int event_id)
{
global_omni_thread_self; // new, in each method
// continue as usual
I think this should solve this issue (without any changes in PyTango) as well as any issues you experience with C++11 threads.
@bourtemb another option is to gradually remove dependency on omni_thread API, with C++11 threads or some Tango-specific class which will wrap the low-level threading interface. For instance replace omni_thread::self()->id()
with:
#ifdef LINUX
pthread_t id = pthread_self();
#else
DWORD id = GetCurrentThreadId();
#endif
@bourtemb Yesterday we talked about adding
omni_orb::ensure_self es{}
in cppTango at the top ofsubscribe_event()
andunsubscribe_event()
. This will result in different omni thread ID assigned for the same physical thread during each call to one of those methods. The problem is that e.g. TangoMonitor relies on omni thread ID to detect re-entrant locking, so we may trigger some unexpected side effects.
I agree with you. I think we should not do that.
How about:
1. creating thread-local `omni_orb::ensure_self` object at global scope in `devapi_base.cpp` or `event.cpp` or as a static member of DeviceProxy: ```c++ namespace Tango { thread_local omni_thread::ensure_self global_omni_thread_self{}; // additionally may be declared 'extern' in some header file // ... ```
This thread_local feature might be very useful indeed in our use case.
2. access this object at the very beginning of each DeviceProxy method to ensure that it gets created. ```c++ void EventConsumer::unsubscribe_event(int event_id) { global_omni_thread_self; // new, in each method // continue as usual ```
I think this should solve this issue (without any changes in PyTango) as well as any issues you experience with C++11 threads.
I think it's a good idea. It might actually be enough to declare it at global scope and to initialize it there. It's maybe not needed to add this
global_omni_thread_self;
line at the beginning of each method as long as the ensure_self object is created for the life time of the current thread, I think it's fine and if you declare and initialize it at global scope, it might be ok...
Tests are needed to verify the behaviour.
@bourtemb another option is to gradually remove dependency on omni_thread API, with C++11 threads or some Tango-specific class which will wrap the low-level threading interface. For instance replace
omni_thread::self()->id()
with:#ifdef LINUX pthread_t id = pthread_self();
This works when using pthreads, but would this work with standard C++11 threads?
I think it's a good idea. It might actually be enough to declare it at global scope and to initialize it there. It's maybe not needed to add this global_omni_thread_self; line at the beginning of each method as long as the ensure_self object is created for the life time of the current thread, I think it's fine and if you declare and initialize it at global scope, it might be ok... Tests are needed to verify the behaviour.
According to this SO question: https://stackoverflow.com/questions/24253584/when-is-a-thread-local-global-variable-initialized You must explicitly access the globlal variable from each thread to ensure that it gets created, i.e. standard only requires that it must be created before first use.
#ifdef LINUX pthread_t id = pthread_self();
This works when using pthreads, but would this work with standard C++11 threads?
C++11 threads are just convenient wrappers around basic threading libraries. On linux systems it is probably always pthreads (not sure about BSD, Solaris and others), e.g. on Ubuntu 18.04:
/usr/include/c++/7/thread 74: typedef __gthread_t native_handle_type;
/usr/include/x86_64-linux-gnu/c++/7/bits/gthr-posix.h 47:typedef pthread_t __gthread_t;
See e.g. an example how to use pthread API to change priority of C++11 thread: https://en.cppreference.com/w/cpp/thread/thread/native_handle
The idea is to have a simple class like TangoThreadId, which will store native thread id during construction, and overload operator== for comparison. But I'm not sure if without creating "ensure_self" you can access omni_mutex or omni_contition.
Let me answer myself:
But I'm not sure if without creating "ensure_self" you can access omni_mutex or omni_contition.
class _OMNITHREAD_NTDLL_ omni_mutex {
public:
omni_mutex();
~omni_mutex();
inline void lock() { OMNI_MUTEX_LOCK_IMPLEMENTATION }
inline void unlock() { OMNI_MUTEX_UNLOCK_IMPLEMENTATION }
...
#define OMNI_MUTEX_LOCK_IMPLEMENTATION \
pthread_mutex_lock(&posix_mutex);
#define OMNI_MUTEX_TRYLOCK_IMPLEMENTATION \
return !pthread_mutex_trylock(&posix_mutex);
Similar for NT threads.
So we need omni-thread friendly threads (ensure_self) probably only for ID access and thread local storage.
Hi Michal,
I finally I could try your patch proposed in https://github.com/tango-controls/pytango/issues/292#issuecomment-540334032. I tried it in the Sardana code (this is where we originally found the issue) and not in the simplified example from the initial comment of this issue.
I tried it in two different places:
es = lib.omni_thread_ensure_self()
in taurus.core.util.threadpool.Worker. Sardana just uses this kind of worker threads and actually macros are executed by them.
Note that the second scenario may cause that this line will be executed more than once by the same thread. The first one does not have this risk.
Then of course I have reverted our workaround commit so we again use disposable taurus.Attribute
objects. The workaround was basically that we were using directly disposable PyTango
proxies (this avoids the subscription to the configuration events done by Taurus).
Unfortunatelly our testsuite hangs using this workaround - I run it 5 times always with the same result. The same testsuite without doing anything works correctly (we run it for months in CI and I also repeated it now on my testing platform). Also if I just apply your code (es = lib.omni_thread_ensure_self()
), but do not revert our workaround commit the testsuite works correctly.
Just to complement the previous comment here is the branch on which I tried it. The last commit applies the workaround as explaind in point 2. The previous commit is the rever of our workaround.
Hi @reszelaz thanks for detailed information!
I'm trying to reproduce the issue using Sardana test suite but I do not see any crash or deadlock. I have: cppTango (tango-9-lts), pytango (v9.3.1 from git), itango (latest from pypi), taurus (latest from pypi), sardana (branch workaround_pytango292 from your fork), all running on Python 3. I have started Pool (demo1), MacroServer (demo1), ran sar_demo from spock and then ran sardanatestsuite
script. Some tests are failing, it's probably misconfiguration in my environment, but other than that no process crashed or hung.
Ran 307 tests in 513.127s
FAILED (errors=41, skipped=13, expected failures=2)
Is this the correct way to reproduce the issue? Which test case from Sardana test suite shows the problem?
I will answer with more details on Monday (I'm out of office now). Similar problems happens with #318. I investigated a little bit and I suspect that the events stop to work at some point. I was even able to reproduce it without running the whole test suite. We may be hitting another bug.. But don't worry, I will keep you posted with something easier to reproduce (hopefully next week). And many thanks for trying it yourself!
Hi again, Sorry for the long silence! I just checked the new PR #327 proposed by @ajoubertza and the behavior is exactly the same as with the initial workaround proposed by @mliszcz and #318 (now rejected in favor of #327). As I said previously I was able to reproduce it with just a subset of tests - the whole sardanatestsuite is not necessary. So, the said tests just hangs:
zreszela@pc255:~/workspace/sardana (workaround_pytango292)> python3 -m unittest sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup
MainThread WARNING 2020-02-05 11:38:57,014 TaurusRootLogger: <frozen importlib._bootstrap>:222: DeprecationWarning: taurus.external.unittest is deprecated since 4.3.2. Use unittest instead
MainThread INFO 2020-02-05 11:38:57,319 Starter: Starting server Pool/unittest1_1...
/home/zreszela/workspace/taurus/lib/taurus/core/tango/starter.py:107: ResourceWarning: unclosed file <_io.BufferedWriter name='/dev/null'>
self.start()
MainThread INFO 2020-02-05 11:39:00,352 Starter: Server Pool/unittest1_1 has been started
MainThread WARNING 2020-02-05 11:39:02,097 TaurusRootLogger: /home/zreszela/workspace/sardana/src/sardana/taurus/core/tango/sardana/pool.py:220: DeprecationWarning: _get_value is deprecated since 4.0. Use .rvalue instead
v = evt_value.value
MainThread INFO 2020-02-05 11:39:02,106 pc255.cells.es:10000.b95a5064-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread WARNING 2020-02-05 11:39:02,107 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread INFO 2020-02-05 11:39:02,485 Starter: Deleted device pool/demo1/2
MainThread INFO 2020-02-05 11:39:02,485 Starter: Hard killing server Pool/unittest1_1...
MainThread INFO 2020-02-05 11:39:02,496 Starter: Server Pool/unittest1_1 has been stopped
MainThread INFO 2020-02-05 11:39:02,499 Starter: Deleted Server Pool/unittest1_1
.MainThread INFO 2020-02-05 11:39:02,509 Starter: Starting server Pool/unittest1_1...
MainThread INFO 2020-02-05 11:39:05,522 Starter: Server Pool/unittest1_1 has been started
MainThread INFO 2020-02-05 11:39:06,877 pc255.cells.es:10000.bc3c3770-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread WARNING 2020-02-05 11:39:06,879 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread WARNING 2020-02-05 11:39:06,897 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread INFO 2020-02-05 11:39:07,234 Starter: Deleted device pool/demo1/2
MainThread INFO 2020-02-05 11:39:07,235 Starter: Hard killing server Pool/unittest1_1...
MainThread INFO 2020-02-05 11:39:07,246 Starter: Server Pool/unittest1_1 has been stopped
MainThread INFO 2020-02-05 11:39:07,249 Starter: Deleted Server Pool/unittest1_1
.MainThread INFO 2020-02-05 11:39:07,258 Starter: Starting server Pool/unittest1_1...
MainThread INFO 2020-02-05 11:39:09,268 Starter: Server Pool/unittest1_1 has been started
MainThread INFO 2020-02-05 11:39:10,640 pc255.cells.es:10000.be73242c-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread WARNING 2020-02-05 11:39:10,642 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread WARNING 2020-02-05 11:39:10,662 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread INFO 2020-02-05 11:39:11,030 Starter: Deleted device pool/demo1/2
MainThread INFO 2020-02-05 11:39:11,030 Starter: Hard killing server Pool/unittest1_1...
MainThread INFO 2020-02-05 11:39:11,039 Starter: Server Pool/unittest1_1 has been stopped
MainThread INFO 2020-02-05 11:39:11,041 Starter: Deleted Server Pool/unittest1_1
.MainThread INFO 2020-02-05 11:39:11,049 Starter: Starting server Pool/unittest1_1...
MainThread INFO 2020-02-05 11:39:14,062 Starter: Server Pool/unittest1_1 has been started
MainThread INFO 2020-02-05 11:39:15,449 pc255.cells.es:10000.c158a87e-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread WARNING 2020-02-05 11:39:15,451 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread WARNING 2020-02-05 11:39:15,466 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread INFO 2020-02-05 11:39:15,898 Starter: Deleted device pool/demo1/2
MainThread INFO 2020-02-05 11:39:15,898 Starter: Hard killing server Pool/unittest1_1...
MainThread INFO 2020-02-05 11:39:15,910 Starter: Server Pool/unittest1_1 has been stopped
MainThread INFO 2020-02-05 11:39:15,913 Starter: Deleted Server Pool/unittest1_1
.MainThread INFO 2020-02-05 11:39:15,921 Starter: Starting server Pool/unittest1_1...
MainThread INFO 2020-02-05 11:39:18,936 Starter: Server Pool/unittest1_1 has been started
MainThread INFO 2020-02-05 11:39:20,318 pc255.cells.es:10000.c43f79a0-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread WARNING 2020-02-05 11:39:20,320 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread WARNING 2020-02-05 11:39:20,336 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
Dummy-1 INFO 2020-02-05 11:39:34,079 pc255.cells.es:10000.c43f79a0-4803-11ea-bc3a-f8b156a48f79.configuration: Activating polling. Reason: API_EventTimeout
As we can see, at the very end, the client (python -m unittest ....) receives "API_EventTimeout".
Before continuing, let me explain a little bit what this TestMeasurementGroup do. There is just one test definitiion and we execute it 6 times with different parameters (differnt experimental channels in the MeasurementGroup). Every test execution creates a Pool device server (until there is a problem with one of the tests it will always use the same instance name: unittest1_1) and populates it with test devices. Then it creates a MeasurementGroup with the experimental channels specified as parameters and makes an acquisition. And all this is repeated 6 times. But it got hangs before finishing...
When the tests are hung aparently the Pool device server does not hang. Its devices respond to the state queries (also from the MeasurementGroup device).
So, now I use:
tango.EnsureOmniThread()
applied in the Worker.run
:
def run(self):
with tango.EnsureOmniThread():
get = self.pool.jobs.get
while True:
cmd, args, kw, callback, th_id, stack = get()
if cmd:
self.busy = True
self.cmd = cmd.__name__
try:
if callback:
callback(cmd(*args, **kw))
else:
cmd(*args, **kw)
except:
orig_stack = "".join(format_list(stack))
self.error("Uncaught exception running job '%s' called "
"from thread %s:\n%s",
self.cmd, th_id, orig_stack, exc_info=1)
finally:
self.busy = False
self.cmd = ''
else:
self.pool.workers.remove(self)
return
We also tried this workaround at one of the beamlines with much more complicated macros than these tests. The result was that the Sardana servers were hanging after prior reports of serialization monitor errors.
Hi @reszelaz. This is quite a difficult problem to solve!
The Taurus thread pool implementation looks correct. I searched the Sardana code for "thread" and I do see a few other places where standard threading.Thread
objects are created. E.g., PoolMonitor
and DummyEventSource
. Could any other threads (not using Taurus thread pool) be involved in the code being running in this test? Maybe there are others that need EnsureOmniThread
.
Thanks Anton for looking into it again! I have advanced a little bit with the debugging, let's see if the following information is helpful to you...
Actually the tests (client side) that I use to reproduce this issue don't use threads. The server (Pool) which is used in these tests uses threads but I suppose it does not matter. The PoolMonitor
thread is started in the Pool but then it is immediatelly suspended. There are threads launched in the acquisition process. DummyEventSource
is a helper class but it is not used in these tests. So, IMO, the tango.EnsureOmniThread()
in the Taurus thread pool is not necessary. I have performed tests with and without it and the results are the same.
I have put some debug prints in PyTango: __DeviceProxy__subscribe_event_attrib
and __DeviceProxy__unsubscribe_event
. Curiously the API_EventTimeout
happens after __del__
(and unsubcribes) are called by garbage collector in the middle of event subscritpion (in this example while subscribing to NbStarts attribute configuration event the unsubscribe is called on configuration and change events of state attributes of two other devices):
test_count_5 (sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup)
count with Tango attribute ...
MainThread INFO 2020-02-07 15:27:37,592 Starter: Starting server Pool/unittest1_1...
MainThread INFO 2020-02-07 15:27:39,602 Starter: Server Pool/unittest1_1 has been started
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 63
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 64
__DeviceProxy.subscribe_event_attrib(configuration): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(configuration): after self.__subscribe_event; id = 65
__DeviceProxy.subscribe_event_attrib(configuration): before self.__subscribe_event
MainThread INFO 2020-02-07 15:27:40,931 pc255.cells.es:10000.ff4670cc-49b5-11ea-aa43-f8b156a48f79: Configuration changed
__DeviceProxy.subscribe_event_attrib(configuration): after self.__subscribe_event; id = 66
MainThread WARNING 2020-02-07 15:27:40,933 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 67
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 68
MainThread WARNING 2020-02-07 15:27:40,948 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 69
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 70
__DeviceProxy.subscribe_event_attrib(IntegrationTime): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(IntegrationTime): after self.__subscribe_event; id = 71
__DeviceProxy.subscribe_event_attrib(IntegrationTime): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(IntegrationTime): after self.__subscribe_event; id = 72
__DeviceProxy.subscribe_event_attrib(Moveable): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(Moveable): after self.__subscribe_event; id = 73
__DeviceProxy.subscribe_event_attrib(Moveable): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(Moveable): after self.__subscribe_event; id = 74
__DeviceProxy.subscribe_event_attrib(NbStarts): before self.__subscribe_event
__DeviceProxy__unsubscribe_event(67): before self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(67): after self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(68): before self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(68): after self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(69): before self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(69): after self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(70): before self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(70): after self.__unsubscribe_event
__DeviceProxy.subscribe_event_attrib(NbStarts): after self.__subscribe_event; id = 75
__DeviceProxy.subscribe_event_attrib(NbStarts): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(NbStarts): after self.__subscribe_event; id = 76
Dummy-1 INFO 2020-02-07 15:27:56,918 pc255.cells.es:10000.ff4670cc-49b5-11ea-aa43-f8b156a48f79.configuration: Activating polling. Reason: API_EventTimeout
From the all tests output you can see that in the case of the previous tests the garbage collection does not happen while we subsribe. I also attach backtrace of all threads from the hung tests process.
You should be able to reproduce it very easily. It happens almost always when running:
python3 -m unittest sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup -v
However sometimes it luckily finishes even after throwing some API_EventTimeouts, but I would say that in 80 % of the times it hangs the tests process.
It is enough to use:
I have a vauge idea of what is happening. I think that three attributes are involved in this issue, let's call them attr1, attr2 and attr3. And the flow of actions is:
Do you think this may cause troubles?
Hi Zibi,
have you tried to put
with tango.EnsureOmniThread():
into the __del__
method.
It is executed in a different thread.
Hi Jan!
Thanks for the hint! I will check it on Monday. Initially I discarded this cause the thread that executes push_event
comes from Tango and I was assuming it is an omni thread. I will keep you updated!
Cheers!
I was also thinking about which thread the __del__
methods will be called in. I wanted to modify your test script to include the thread name in the subscription and unsubscription print messages, but I couldn't get a Docker environment set up that ran tests at all.
Instead, I made a simple test script (no Tango code involved) that creates objects with reference cycles from a number of threads, with random sleeps in between. The __del__
methods get called from random threads, in batches. Only sometimes is the deletion done in the same thread that created the object. Python is free to call the GC after any byte code operation.
There's an interesting PEP related to this that hasn't been implemented yet (deferred state): https://www.python.org/dev/peps/pep-0556/ They talk about reentrancy being a problem. Even better is this blog article: https://codewithoutrules.com/2017/08/16/concurrency-python/ which warns of doing much in __del__
.
have you tried to put
with tango.EnsureOmniThread():
into the
__del__
method.
This might be dangerous, since the dummy omniORB ID will only last as long as the __del__
method. If the same thread calls __del__
again later it would get a new ID, which could affect the locks in cppTango that are based on the omniORB thread ID.
You can also use the new tango.is_omni_thread()
if you just want to check what kind of thread it is.
@ajoubertza, thanks for nice readings. It looks like the second article describes what is going on with Taurus/Sardana in a simple way. In py2 because of cyclic dependences the content of __del__
is not called. Contrary to py2 the py3 with its gc can remove objects with cyclic dependences but it is done asynchronously (in a "random" thread). It would be nice to find and remove those cyclic dependences (if they exists) but it does not guarantee to solve the problem. Thus one could
__del__
content to other method in order not to call unsubscribe by gc (my solution from July'19 but it requires refactorization of taurus/sardana code)tango.EnsureOmniThread()
and tango.is_omni_thread()
(may be "dangerous")tango.EnsureOmniThread()
to work always with any thread (is complicated , e.g. it may require static bookkeeping metainfo about omni threads)Anyone knows any other option?
Another option is not to unsubscribe at all when your client objects are finalised - that is what was happening under Python 2, since __del__
wasn't called. Won't the device eventually give up sending the events if the device proxy that subscribed is no longer connected?
Hi all,
Thanks Jan and Anton for investigating this issue. I see that you advanced a lot! I will try to answer in the chronological order:
I have tried to use with tango.EnsureOmniThread()
in __del__
but the result is exactly the same. Anyway Anton already discouraged its usage. Also I think it should not be necessary cause the event consumer thread should be already omni thread.
I was also thinking about which thread the del methods will be called in. I wanted to modify your test script to include the thread name in the subscription and unsubscription print messages,
I've done it last Friday, based on this I took my vaugh interpretation of what is going on. There are just two threads: MainThread and Dummy-1. The Dummy-1 is the one which executes event callbacks. Then if two conditions are met:
it hangs the tests. See logs with thread names from the tests execution
but I couldn't get a Docker environment set up that ran tests at all.
You can try using the reszelaz/sardana-test docker image and follow the instructions from the "How to develop sardana using this image" section. I just updated the image so it uses the latest taurus from develop. So, basically you will need to do something like this:
docker pull reszelaz/sardana-test
docker run -d --name=sardana-test -h sardana-test -v /home/zreszela/workspace/sardana:/sardana reszelaz/sardana-test
docker exec sardana-test bash -c "cd /sardana && python3 setup.py develop"
docker exec sardana-test bash -c "python3 -m unittest sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup -v"
Note: /home/zreszela/workspace/sardana
is the path where I cloned sardana, you will need to change it. Rembember to use the branch I said above.
In my case, to reproduce this issue on docker I was forced to add more tests to the TestMeasurementGroup
class. Basically I doubled them copying and pasting the already defined tests. And with this I see API_EventTimeout almost at every execution.
Whenever you reproduce it, to repeat it again, you will need to kill the hung processes (server & tests) and clean the DB:
docker exec sardana-test bash -c "killall -9 /usr/bin/python3"
docker exec sardana-test bash -c "python3 -c 'import tango; db = tango.Database(); [db.delete_server(server) for server in db.get_server_list(\"Pool/unittest*\")]'"
I just had time to read the above blog article. Based on it I wonder what is the lock that we deadlock on? It must come from Tango, right? I have made some investigation and I would rather discard this PyTango lock: https://github.com/tango-controls/pytango/blob/5865b6f0a53f2264df1d1443147b2e34b7399fc5/tango/device_proxy.py#L1073
Please correct if I'm wrong, but if we are just using MainThread and Dummy-1 thread (Tango event consumer) and it still hangs (or produces API_EventTimeouts), then maybe we are hitting some other problem and not the one that can be fixed by the tango.EnsureOmniThread
context manager?
Another option is not to unsubscribe at all when your client objects are finalised - that is what was happening under Python 2, since
__del__
wasn't called. Won't the device eventually give up sending the events if the device proxy that subscribed is no longer connected?
In cppTango, when a DeviceProxy object is deleted, it unsubscribes to the events already subscribed via this DeviceProxy object. This was added in 2016 by @taurel , probably for a very good reason(?!). So do you suggest to modify cppTango in order to remove this unsubscription in DeviceProxy destructor? ~DeviceProxy is calling unsubscribe_all_events() (https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/devapi_base.cpp#L2641) which is invoking several methods which are taking a lock which is also taken during the event subscription: https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/devapi_base.cpp#L2678 and https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/devapi_base.cpp#L2641
Here are the lines where the lock is taken: In ZmqEventConsumer::get_subscribed_event_ids(): https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/zmqeventconsumer.cpp#L3431 In EventConsumer::unsubscribe_event(int): https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/event.cpp#L1870 and https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/event.cpp#L1908
In EventConsumer::subscribe_event: https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/event.cpp#L1246
I think what's happening in your use case is very similar to what is described in https://codewithoutrules.com/2017/08/16/concurrency-python/ in the "Reentrancy!" section.
As far as I understand, the main problem comes from the fact that we enter into a situation where the garbage collector is invoked in the middle of an operation which has taken the lock (subscribe_event) and has not yet released it. The garbage collector is trying to delete a DeviceProxy object, so trying to unsubscribe to all the events which were subscribed via this DeviceProxy. This unsubscribe operation is also trying to acquire the lock but it cannot get it because the lock could not be released by the code which was interrupted by the garbage collector and this code is waiting for the garbage collector cycle to end (it will never end). So we are indeed in a deadlock.
This is quite an annoying problem because the problem should not be present in C++. It looks like we cannot use any lock in the destructors.
The warning in the Python3 documentation is also clear on the topic:
__del__() can be invoked when arbitrary code is being executed, including from any arbitrary thread. If __del__() needs to take a lock or invoke any other blocking resource, it may deadlock as the resource may already be taken by the code that gets interrupted to execute __del__().
This blog post seems to discuss a very similar problem in the MongoDB python driver:
https://emptysqua.re/blog/pypy-garbage-collection-and-a-deadlock/
Their solution is to use a background thread that periodically checks a list for cleanup tasks. The __del__
method only appends tasks to that list. Here is a toy example:
https://github.com/ajdavis/what-to-expect-when-youre-expiring/blob/master/example_123_lock.py
That solution is used since 2015, so it seems to be stable:
https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/mongo_client.py#L707
Another solution might be to start a one shot timer in the __del__
method:
import threading
lock = threading.Lock()
class C:
def __del__(self):
def target():
print('getting lock')
with lock:
print('releasing lock')
# target() # this deadlocks
t = threading.Timer(0, target)
t.start()
c = C()
with lock:
del c
It seems to work in this simple test case, but I don't know if it is safe in general.
Thanks Reynald for the Tango C++ details! And Tim for joining this dicussion! Things are getting clearer in my mind now.
Another option is not to unsubscribe at all when your client objects are finalised - that is what was happening under Python 2, since del wasn't called. Won't the device eventually give up sending the events if the device proxy that subscribed is no longer connected?
In cppTango, when a DeviceProxy object is deleted, it unsubscribes to the events already subscribed via this DeviceProxy object. This was added in 2016 by @taurel , probably for a very good reason(?!). So do you suggest to modify cppTango in order to remove this unsubscription in DeviceProxy destructor?
I think that Anton here referred to remove the unsubscribes from taurus.Attribute.__del__
and not from ~DeviceProxy
?
Also about:
that is what was happening under Python 2, since del wasn’t called
Actually I'm not sure if in Python2 these are were not called. I have a vauge memories that these were called but maybe not so often, and maybe much more later in time. But I think it is not worth investigating this. If you agree let's focus on Python 3.
So, what I tried is what Anton said (or at least what I understood:) - to not call unsubscribes in taurus.Attribute.__del__
. And it was still hanging... Here I think we hit what Reynald explains in https://github.com/tango-controls/pytango/issues/292#issuecomment-584627536 about ~DeviceProxy
- it unsubscribes all events when we are destructing it. I have not confirmed it by placing prints in Tango C++, but it makes sense to me. So, I would discard this as a solution for Taurus.
Raynald (@bourtemb), since I'm not so fluent in C++, could you clarify me if in either of these occassions:
it will still try to get the lock when destructing the DeviceProxy?
I'm almost sure, but please also confirm, the lock is global per client, and not per DeviceProxy?
Now, a question more to Python experts. When a GC calls __del__
methods in one of the threads and one of this __del__
methods waits on a lock, Python will not switch to another thread until the GC finishes? and there is no change that the lock will be released? This would explain still unclear to me observation.
The comments below the Tim's first link somehow confirms this:
GC doesn't automatically run on a separate thread - it hijacks whichever thread is active at the moment GC is triggered. But if the thread that happens to be running GC needs a lock that's held by a different thread, then there's a deadlock.
But since there is a contradictory comments below, and also they mention PyPy implementation, I'm not 100% sure yet.
I think that comment is wrong. It should deadlock when the __del__
method is executed on the same thread that holds the lock. Here is a test for that:
import threading
import gc
from time import sleep
lock = threading.Lock()
def target():
print('target')
with lock:
print("thread has lock")
sleep(1)
gc.collect() # this deadlocks
print("thread released lock")
t = threading.Thread(target=target)
t.start()
class C(object):
def __del__(self):
print('gc getting lock')
with lock:
print('gc releasing lock')
pass
c = C()
c.self = c # create cyclic reference
del c
# gc.collect() # this doesn't deadlock
t.join()
Depending on the thread where gc.collect
is called, this script deadlocks or not.
Edit:
The methods subscribe_event
and unsubscribe_event
of PyTango.DeviceProxy
seem to acquire a threading.Lock
. Thus, when the __del__
method calling the unsubscribe_event
method is executed by the GC on a thread that is just executing the part of these methods that holds the lock, it deadlocks.
Raynald (@bourtemb), since I'm not so fluent in C++, could you clarify me if in either of these occassions:
* we subscribed and unsubscribed from all events on the DeviceProxy * we never subscribed to any events on the DeviceProxy
it will still try to get the lock when destructing the DeviceProxy?
~DeviceProxy() calls unsubscribe_all_events()
unsubscribe_all_events() calls ZmqEventConsumer::get_subscribed_event_ids()
ZmqEventConsumer::get_subscribed_event_ids() takes ZmqEventConsumer map_modification_lock, as ReaderLock
This should not be blocking unless another thread currently holds a writer lock on this lock.
Another thread could hold map_modification_lock
as Writer lock if the client is currently trying to subscribe to an event or trying to unsubscribe to an already subscribed event.
Please note that in some cases, the Tango keep alive thread might try periodically to subscribe to events which are not yet connected. This can happen when the client tries to subscribe to some events with the stateless flag set to true.
I'm almost sure, but please also confirm, the lock is global per client, and not per DeviceProxy?
I confirm, this ZmqEventConsumer::map_modification_lock lock is global. It is a lock associated to the ZmqEventConsumer object instance, which is unique per client.
To clarify, PyTango.DeviceProxy
has an additional threading.Lock
called _subscribed_events_lock
:
https://github.com/tango-controls/pytango/blob/develop/tango/device_proxy.py#L1067
which is acquired during subscribe_event
and unsubscribe_event
, e.g.,:
https://github.com/tango-controls/pytango/blob/develop/tango/device_proxy.py#L1291
Calling one of these methods in a __del__
method of an object that has cyclic references leads to the observed deadlocks, because the garbage collector can potentially execute the __del__
method on any python thread after any instruction* as illustrated by above toy example where the GC is triggered manually in different threads.
*Actually, the GC of CPython only runs after instructions that allocate, so avoiding allocations in code paths that hold the lock is a possible but difficult workaround.
There are at least the following solutions or workarounds:
_subscribed_events_lock
in PyTango__del__
methods that call unsubscribe_event
a. Do not unsubscribe, as this seems to be no problem for Python < 3.4 where cyclic references were never garbage collected and thus the __del__
methods were not executed
b. Use a background thread like MongoDB
c. Try out the one shot timer I proposed above__del__
method is executed deterministically on CPythonI think that Anton here referred to remove the unsubscribes from taurus.Attribute.del and not from ~DeviceProxy?
Yes, @reszelaz I meant remove the Python calls to unsubscribe in taurus.Attribute.__del__
.
Thanks for the great analysis, everyone!
Considering @schooft's suggestions:
So, what I tried is what Anton said (or at least what I understood:) - to not call unsubscribes in taurus.Attribute.del. And it was still hanging... Here I think we hit what Reynald explains in #292 (comment) about ~DeviceProxy
gc.collect
at the end of the outer loop in JobThread.run from the original example and it didn't crash or hang anymore.
- Yes, this could work, although we would have to disable the GC, and it might be hard to find a good time to periodically call collect.
I didn't mean to disable the GC but to not allocate any objects in the problematic code paths as done for MongoDB version 2. As far as I understand, the GC is only triggered when new python objects are allocated except for some special occasions like shutdown.
When this issue was first posted, I added a call to gc.collect at the end of the outer loop in JobThread.run from the original example and it didn't crash or hang anymore.
This makes sense. The GC only actually runs when allocations exceed certain thresholds. By running it regularly, the times when the allocations exceed the thresholds are changed. If there are only few allocations, the GC might not run at all outside the manual invocations.
- c) Not sure how often these objects get deleted - if there are hundreds per second, the one-shot timer approach might create a huge number of threads.
Good point.
I agree, that there is likely another problem. The backtrace attached in https://github.com/tango-controls/pytango/issues/292#issuecomment-583457593 does not contain any calls from the garbage collector. At least in the toy example below, I can see the following frames:
#17 0x000055d91f9c3f4f in slot_tp_finalize () at ../Objects/typeobject.c:6343
#18 0x000055d91f966abf in finalize_garbage (collectable=0x7fc3d493a5d0) at ../Modules/gcmodule.c:808
#19 collect () at ../Modules/gcmodule.c:1003
#20 0x000055d91f968089 in collect_with_callback (generation=0) at ../Modules/gcmodule.c:1119
#21 collect_generations () at ../Modules/gcmodule.c:1142
#22 _PyObject_GC_Alloc (basicsize=<optimized out>, use_calloc=0) at ../Modules/gcmodule.c:1708
#23 _PyObject_GC_Malloc () at ../Modules/gcmodule.c:1718
#24 _PyObject_GC_New () at ../Modules/gcmodule.c:1730
#25 0x000055d91fa0be3c in PyList_New () at ../Objects/listobject.c:159
This is the toy example that triggers the GC on a random thread by allocating many objects:
import threading
lock = threading.Lock()
class C(object):
def __del__(self):
print('gc getting lock on', threading.current_thread().getName())
with lock:
print('gc releasing lock')
def target():
print('target')
c2 = C()
c2.self = c2
c2 = None
l2 = []
with lock:
print("thread has lock", threading.current_thread().getName())
for i in range(10000):
l2.append([]) # create allocations to trigger gc
print("thread released lock")
t = threading.Thread(target=target)
t.start()
c = C()
c.self = c
c = None
l = []
for i in range(1000):
l.append([]) # create allocations to trigger gc
t.join()
On my PC this deadlocks roughly every second time. It is noteworthy that at least one garbage collectable object (not necessarily the problematic one) needs to be created on the thread that holds the lock, otherwise I didn't manage to trigger the GC on that thread.
@reszelaz Could you also create a backtrace with thread apply all py-bt
to see in which python code the threads are stuck? (You might need to install the python3-dbg
package on Debian. Alternatively, you could try py-spy dump --native --pid $pid
, see https://github.com/benfred/py-spy/)
Edit: The backtrace above is now the correct one for the toy example.
I got a backtrace myself for the PyTango test case (I am not sure if my system is setup correctly as I have a lot of boost error handling stuff in the full backtrace):
Thread 11 (Thread 0x7f0296ffd700 (LWP 6955)):
Traceback (most recent call first):
Waiting for the GIL
File "/usr/lib/python3/dist-packages/tango/device_proxy.py", line 1305, in __DeviceProxy__unsubscribe_event
self.__unsubscribe_event(event_id)
File "/usr/lib/python3/dist-packages/tango/green.py", line 109, in run
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/tango/green.py", line 195, in greener
return executor.run(fn, args, kwargs, wait=wait, timeout=timeout)
File "PyDsExpClient.py", line 141, in __del__
Garbage-collecting
<built-in method write of _io.TextIOWrapper object at remote 0x7f02baa5d630>
<built-in method print of module object at remote 0x7f02baab15e8>
File "PyDsExpClient.py", line 145, in push_event
File "PyDsExpClient.py", line 106, in __call__
File "/usr/lib/python3/dist-packages/tango/green.py", line 92, in submit
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/tango/green.py", line 210, in greener
return executor.submit(fn, *args, **kwargs)
File "/usr/lib/python3/dist-packages/tango/device_proxy.py", line 1252, in __DeviceProxy__subscribe_event_attrib
attr_name, event_type, cb, filters, stateless, extract_as)
File "/usr/lib/python3/dist-packages/tango/device_proxy.py", line 1188, in __DeviceProxy__subscribe_event
return __DeviceProxy__subscribe_event_attrib(self, *args, **kwargs)
File "/usr/lib/python3/dist-packages/tango/green.py", line 109, in run
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/tango/green.py", line 195, in greener
return executor.run(fn, args, kwargs, wait=wait, timeout=timeout)
File "PyDsExpClient.py", line 131, in __init__
File "PyDsExpClient.py", line 118, in getAttribute
File "PyDsExpClient.py", line 163, in run
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()
So while being in the print function in PyTangoAttribute.push_event
the garbage collector is triggered and calls the __del__
method of a PyTangoAttribute
(the garbage collector only collects attributes with name "state", the "attr1" attributes have no cyclic references and are deleted immediately after initializing a new one).
The relevant backtrace that leads to the call of PyTangoAttribute.push_event
:
#110 0x00007f02b8b93d22 in PyCallBackPushEvent::push_event(Tango::AttrConfEventData*) () from /usr/lib/python3/dist-packages/tango/_tango.cpython-35m-x86_64-linux-gnu.so
#111 0x00007f02b8100c2f in Tango::EventConsumer::get_fire_sync_event(Tango::DeviceProxy*, Tango::CallBack*, Tango::EventQueue*, Tango::EventType, std::__cxx11::basic_string<char, std::char_traits<char>, std::all---Type <return> to continue, or q <return> to quit---
ocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::event_callback&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)
() from /usr/lib/x86_64-linux-gnu/libtango.so.9
#112 0x00007f02b8102492 in Tango::EventConsumer::connect_event(Tango::DeviceProxy*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::EventType, Tango::CallBack*, Tango::EventQueue*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int) () from /usr/lib/x86_64-linux-gnu/libtango.so.9
#113 0x00007f02b8103de4 in Tango::EventConsumer::subscribe_event(Tango::DeviceProxy*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::EventType, Tango::CallBack*, Tango::EventQueue*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool) () from /usr/lib/x86_64-linux-gnu/libtango.so.9
#114 0x00007f02b8104350 in Tango::EventConsumer::subscribe_event(Tango::DeviceProxy*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::EventType, Tango::CallBack*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool) ()
from /usr/lib/x86_64-linux-gnu/libtango.so.9
#115 0x00007f02b8017312 in Tango::DeviceProxy::subscribe_event(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::EventType, Tango::CallBack*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool) ()
from /usr/lib/x86_64-linux-gnu/libtango.so.9
The relevant backtrace from within the __del__
method:
#2 0x00007f02b834ddf9 in Tango::TangoMonitor::wait(long) ()
from /usr/lib/x86_64-linux-gnu/libtango.so.9
#3 0x00007f02b8107ffc in Tango::TangoMonitor::get_monitor() ()
from /usr/lib/x86_64-linux-gnu/libtango.so.9
#4 0x00007f02b81269b0 in Tango::DelayEvent::DelayEvent(Tango::EventConsumer*) ()
from /usr/lib/x86_64-linux-gnu/libtango.so.9
#5 0x00007f02b80fd9ce in Tango::EventConsumer::unsubscribe_event(int) ()
from /usr/lib/x86_64-linux-gnu/libtango.so.9
#6 0x00007f02b801e577 in Tango::DeviceProxy::unsubscribe_event(int) ()
from /usr/lib/x86_64-linux-gnu/libtango.so.9
The full backtrace backtrace.txt
This was one of the runs. During other runs, I didn't get a deadlock/crash but instead this exception:
PyTangoAttribute.__del__: before unsubscribe state
Exception ignored in: <bound method PyTangoAttribute.__del__ of <__main__.PyTangoAttribute object at 0x7f263243a5c0>>
Traceback (most recent call last):
File "PyDsExpClient.py", line 141, in __del__
self._dev.dp.unsubscribe_event(id_)
File "/usr/lib/python3/dist-packages/tango/green.py", line 195, in greener
return executor.run(fn, args, kwargs, wait=wait, timeout=timeout)
File "/usr/lib/python3/dist-packages/tango/green.py", line 109, in run
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/tango/device_proxy.py", line 1305, in __DeviceProxy__unsubscribe_event
self.__unsubscribe_event(event_id)
PyTango.EventSystemFailed: (DevError(desc = 'Failed to unsubscribe event, the event id specified does not correspond with any known one', origin = 'EventConsumer::get_event_system_for_event_id()', reason = 'API_EventNotFound', severity = tango._tango.ErrSeverity.ERR),)
This was the __del__
method of a rather old attribute object which should have been deleted many iterations earlier. Hope this helps.
To trigger the above error reliably, just call gc.collect
in PyTangoAttribute.push_event
.
Hi all,
Tim, sorry that I did not sent you the backtrace before but this morning I rebooted my PC and afterwards the Sardana tests that I used to run started to behave differently. At the beginning I was getting crazy but now I think it helped me to advance a little bit...
So, after the reboot, the earlier mentioned test module was reproducibly hanging at the very first test. It was very aligned with this observation https://github.com/tango-controls/pytango/issues/292#issuecomment-583533606 but this time, the event consumer thread was not even involved. Then, I thought that I was hitting the potential deadlock of the GC waiting on _subscribed_events_lock
as you explained. So, I added some more logs around the lock. But, I think I was not hitting this one. Then, simplifying the test I came into this example. Could it be the root cause of this issue? It would be nice if someone could test it and confirm if it is reproducible outside of my environment.
Now, regarding your tests from https://github.com/tango-controls/pytango/issues/292#issuecomment-586361482. I'm not sure, but I think that for this one it is needed to use the tango.EnsureOmniThread()
context manager since there we are using threads which are not omni threads. But, better if someone else confirms that.
Also, regarding the potential deadlock on _subscribed_events_lock
lock. This lock was already under revision in https://github.com/tango-controls/pytango/issues/185#issuecomment-367382820, maybe we could now review again if it is needed? If it is finally needed, Tim's idea to change this particular PyTango code to ensure that the GC won't be called sounds good to me. GC can not enter on this thread while we are already executing Tango C++ calls, right? Also I discard changing the lock to RLock
, cause we may mess up with the map, right?
At some time came to my mind that instead of calling unsubscribes in __del__
call them in weakref callbacks or finalizers (taurus.Factory
stores weakrefs to taurus.Attribute
). But I imagine these must behave similarly to the __del__
- will be called by GC process (at random moment and at random thread), right?
Thanks again for all your help in trying to fix this issue!
It would be nice if someone could test it and confirm if it is reproducible outside of my environment.
I tried the example, and get almost identical result (only differences in the API_EventTimeout text). My test environment is Docker + Conda + Python 3.7.5 + cppTango 9.3.2 + your pytango branch.
(env-py3.7-tango9.3.2) root@d2d458b3bc7d:/opt/pytango/examples/issue-292/test-tango-unsub# python3 client1.py
MainThread:__DeviceProxy.__subscribe_event_attrib(attr1): before self.__subscribe_event
2020-02-15 13:47:27.278990 TEST/DEVICE1/1 ATTR1 ATTR_CONF label='attr1'; unit=''
MainThread:__DeviceProxy.__subscribe_event_attrib(attr1): after self.__subscribe_event; id = 1
MainThread:__DeviceProxy.__subscribe_event(1): before event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(1): after event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(1): after event_map_lock release
MainThread:__DeviceProxy.__subscribe_event_attrib(attr1): before self.__subscribe_event
2020-02-15 13:47:28.293413 TEST/DEVICE2/1 ATTR1 CHANGE [ATTR_VALID] 0.0
MainThread:__DeviceProxy.__subscribe_event_attrib(attr1): after self.__subscribe_event; id = 2
MainThread:__DeviceProxy.__subscribe_event(2): before event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(2): after event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(2): after event_map_lock release
MainThread:__DeviceProxy.__subscribe_event_attrib(attr2): before self.__subscribe_event
MainThread:__DeviceProxy__unsubscribe_event(1): before event_map_lock acquire
MainThread:__DeviceProxy__unsubscribe_event(1): after event_map_lock acquire
MainThread:__DeviceProxy__unsubscribe_event(1): after event_map_lock release
MainThread:__DeviceProxy__unsubscribe_event(1): before self.__unsubscribe_event
MainThread:__DeviceProxy__unsubscribe_event(1): after self.__unsubscribe_event
MainThread:__DeviceProxy.__subscribe_event_attrib(attr2): after self.__subscribe_event; id = 3
MainThread:__DeviceProxy.__subscribe_event(3): before event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(3): after event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(3): after event_map_lock release
2020-02-15 13:47:39.264995 TEST/DEVICE2/1 ATTR1 CHANGE [API_EventTimeout] Event channel is not responding anymore, maybe the server or event system is down
2020-02-15 13:47:39.266084 TEST/DEVICE2/1 ATTR1 CHANGE [ATTR_VALID] 0.0
ERROR: API_EventTimeout
2020-02-15 13:47:49.280982 TEST/DEVICE2/1 ATTR1 CHANGE [API_EventTimeout] Event channel is not responding anymore, maybe the server or event system is down
2020-02-15 13:47:49.282081 TEST/DEVICE2/1 ATTR1 CHANGE [ATTR_VALID] 0.0
ERROR: API_EventTimeout
2020-02-15 13:47:59.292498 TEST/DEVICE2/1 ATTR1 CHANGE [API_EventTimeout] Event channel is not responding anymore, maybe the server or event system is down
2020-02-15 13:47:59.293321 TEST/DEVICE2/1 ATTR1 CHANGE [ATTR_VALID] 0.0
ERROR: API_EventTimeout
@reszelaz Thank you for this great example case!
On my machine I get the same error. I also implemented the same client in C++ and the error occurs there as well.
Thanks for the tests! WIth Tim's example in C++ I already reported it https://github.com/tango-controls/cppTango/issues/686.
Hi all, I was just reviewing old issues and I think that this one can be closed already. We are happy with the fix with cppTango.
Hi PyTango experts,
I would like to report an issue that we found in Sardana project after migrating it to Python 3.
First of all sorry for the complexity of the exaples (one uses pure PyTango and another one Taurus) to reproduce it, but I was not able to reduce them more. Also sorry if the problem is finally not in PyTango but in Tango. But I do not have knowledge on how to reproduce it with Tango C++.
I think all the necessary information and the steps to reproduce it are in https://github.com/reszelaz/test-tango-py3.
If you have any questions, don't hesitate to ask.
Cheers, Zibi