paulscherrerinstitute / pcaspy

Portable Channel Access Server in Python
BSD 3-Clause "New" or "Revised" License
32 stars 24 forks source link

deadlock with updates in epics base 3.14.12.6-rc1 and 3.15.5-rc1 #29

Closed xiaoqiangwang closed 7 years ago

xiaoqiangwang commented 7 years ago

Currenly all pcaspy applications assume that they can call Driver::updatePVs() at a separate thread. And the main thread will pick this request and send monitor events to clients.

In the release of 3.14.12.6-rc1 and 3.15.5-rc1, the pcas server receives an update to support dynamic length array. The affecting change is the additional call to [chan.getPVI().nativeCount()] (https://github.com/epics-base/epics-base/commit/07434172319922b287ce6277b89ce2e82bced3cb#diff-4eafaaeaa6480eaed1cfd866d63cd0adR886) inside casStrmClient::monitorResponse.

When the deadlock happens, here is the backtrace from both main thread 1 and python thread 2,

Thread 2 (Thread 0x7fbd87537700 (LWP 15445)):
#0  0x00000035ebe0e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000035ebe095f3 in _L_lock_892 () from /lib64/libpthread.so.0    
#2  0x00000035ebe094d7 in pthread_mutex_lock () from /lib64/libpthread.so.0                                                                                       
#3  0x00007fbd879d1026 in mutexLock (pmutex=0x1c7c230) at ../../../src/libCom/osi/os/posix/osdMutex.c:44                                                          
#4  epicsMutexOsdLock (pmutex=0x1c7c230) at ../../../src/libCom/osi/os/posix/osdMutex.c:116                                                                       
#5  0x00007fbd879c9d9c in epicsMutex::lock (this=<value optimized out>) at ../../../src/libCom/osi/epicsMutex.cpp:236                                             
#6  0x00007fbd880b1b9f in epicsGuard (this=<value optimized out>, monitorList=..., select=..., event=...) at ../../../../include/epicsGuard.h:68                  
#7  casEventSys::postEvent (this=<value optimized out>, monitorList=..., select=..., event=...) at ../../../../src/cas/generic/casEventSys.cc:334                 
#8  0x00007fbd880aee1c in postEvent (this=0x20c7db0, select=..., event=...) at ../../../../src/cas/generic/casCoreClient.h:167                                    
#9  postEvent (this=0x20c7db0, select=..., event=...) at ../../../../src/cas/generic/chanIntfForPV.h:71                                                           
#10 casPVI::postEvent (this=0x20c7db0, select=..., event=...) at ../../../../src/cas/generic/casPVI.cc:282                                                        
#11 0x00007fbd88502199 in PV::postEvent (this=0x1e95270, mask=<value optimized out>, value=...) at pcaspy/pv.cpp:96                                               
#12 0x00007fbd884f65d9 in _wrap_PV_postEvent (args=<value optimized out>) at pcaspy/casdef_wrap.cpp:9740                                                          
#13 0x00000036006d55b6 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#14 0x00000036006d7647 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0                                                                                
#15 0x00000036006d5a94 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#16 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#17 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#18 0x00000036006d7647 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0                                                                                
#19 0x000000360066ad9d in ?? () from /usr/lib64/libpython2.6.so.1.0                                                                                               
#20 0x0000003600643c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0                                                                                    
#21 0x00000036006d4460 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#22 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#23 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#24 0x00000036006d7647 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0                                                                                
#25 0x000000360066aca0 in ?? () from /usr/lib64/libpython2.6.so.1.0                                                                                               
#26 0x0000003600643c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0                                                                                    
#27 0x00000036006566af in ?? () from /usr/lib64/libpython2.6.so.1.0                                                                                               
#28 0x0000003600643c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0                                                                                    
#29 0x00000036006cfc83 in PyEval_CallObjectWithKeywords () from /usr/lib64/libpython2.6.so.1.0                                                                    
#30 0x00000036007017aa in ?? () from /usr/lib64/libpython2.6.so.1.0                                                                                               
#31 0x00000035ebe07aa1 in start_thread () from /lib64/libpthread.so.0                                                                                             
#32 0x00000035eb2e893d in clone () from /lib64/libc.so.6                                                                                                          

Thread 1 (Thread 0x7fbd92e08700 (LWP 15443)):
#0  0x00000035ebe0e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000035ebe095f3 in _L_lock_892 () from /lib64/libpthread.so.0    
#2  0x00000035ebe094d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fbd879d1026 in mutexLock (pmutex=0x20783e0) at ../../../src/libCom/osi/os/posix/osdMutex.c:44
#4  epicsMutexOsdLock (pmutex=0x20783e0) at ../../../src/libCom/osi/os/posix/osdMutex.c:116             
#5  0x00007fbd879c9d9c in epicsMutex::lock (this=<value optimized out>) at ../../../src/libCom/osi/epicsMutex.cpp:236
#6  0x00007fbd880adbc8 in epicsGuard (this=0x20c7db0) at ../../../../include/epicsGuard.h:68                         
#7  casPVI::nativeCount (this=0x20c7db0) at ../../../../src/cas/generic/casPVI.cc:521                                
#8  0x00007fbd880a9ad4 in casStrmClient::monitorResponse (this=0x2112900, guard=..., chan=..., msg=..., desc=..., completionStatus=0)
    at ../../../../src/cas/generic/casStrmClient.cc:886                                                                              
#9  0x00007fbd880b23a3 in casMonitor::response (this=<value optimized out>, guard=<value optimized out>, client=<value optimized out>, 
    value=<value optimized out>) at ../../../../src/cas/generic/casMonitor.cc:61                                                       
#10 0x00007fbd880b24e7 in casMonitor::executeEvent (this=0x21152d0, client=..., ev=..., value=<value optimized out>, clientGuard=..., evGuard=...)
    at ../../../../src/cas/generic/casMonitor.cc:110                                                                                              
#11 0x00007fbd880b1fd2 in casEventSys::process (this=0x2112928, casClientGuard=...) at ../../../../src/cas/generic/casEventSys.cc:133             
#12 0x00007fbd880b75d9 in eventSysProcess (this=0x2112b48) at ../../../../src/cas/generic/casCoreClient.h:176
#13 casStreamEvWakeup::expire (this=0x2112b48) at ../../../../src/cas/generic/st/casStreamOS.cc:150
#14 0x00007fbd879d6943 in timerQueue::process (this=0x1e126f8, currentTime=...) at ../../../src/libCom/timer/timerQueue.cpp:139
#15 0x00007fbd879be388 in fdManager::process (this=0x7fbd87beea40, delay=<value optimized out>) at ../../../src/libCom/fdmgr/fdManager.cpp:127
#16 0x00007fbd884f1e17 in _wrap_process (args=<value optimized out>) at pcaspy/casdef_wrap.cpp:11076
#17 0x00000036006d59d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0
#18 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0
#19 0x00000036006d7647 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0
#20 0x00000036006d7722 in PyEval_EvalCode () from /usr/lib64/libpython2.6.so.1.0
#21 0x00000036006f1b9c in ?? () from /usr/lib64/libpython2.6.so.1.0
#22 0x00000036006f1c70 in PyRun_FileExFlags () from /usr/lib64/libpython2.6.so.1.0
#23 0x00000036006f315c in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.6.so.1.0
#24 0x00000036006ff892 in Py_Main () from /usr/lib64/libpython2.6.so.1.0
#25 0x00000035eb21ed5d in __libc_start_main () from /lib64/libc.so.6
#26 0x0000000000400649 in _start ()
xiaoqiangwang commented 7 years ago

Here is how the deadlock happens,

main thread               python update thread
casEventSys::process      casPVI::postEvent
casPVI::nativeCount       casEventSys::postEvent
bhill-slac commented 7 years ago

Hi Xiaoqiang, Do you have a pcaspy example that I can run to reproduce this? Thanks,

xiaoqiangwang commented 7 years ago

Hi Bruce,

I used the simscope.py as the test program.

After building pcaspy 0.6.2 or the current HEAD in the repo,

$ cd example
# make a soft link to the current build
$ ln -s ../build/lib.linux-x86_64-2.6/pcaspy .
$ python simscope.py

At another terminal, launch the MEDM panel, change the Update time to 0.01 second, and click Run.

medm -x -macro P=MTEST simscope.adl

After a few seconds (in my case), the sinuous wave stops updating. And now if you attach gdb to the PCASpy program, and exec command (gdb) thread apply all bt. The traceback shows two threads are stuck in pthread_mutex_lock, as show in the first post.

xiaoqiangwang commented 7 years ago

The official release of EPICS base 3.14.12.6 has fixed this deadlock.

till-s commented 5 years ago

Note that the fix mentioned above only remedies one particular race condition but there are more...

I filed a bug report here

https://bugs.launchpad.net/epics-base/+bug/1830957

xiaoqiangwang commented 5 years ago

The deadlock you discovered is indeed serious.

https://bugs.launchpad.net/epics-base/+bug/1830957

In theory the deadlock should happen soon if I do "caput -c " on a fast changing PV. I used simscopy.py as the test program (updating in 100Hz) and ran the "caput -c " in 50Hz.

count=1;while [ 1 ]; do caput -c MTEST:MinValue $((count++));sleep 0.02; done

However it cannot be reproduced. In what usage do you see this deadlock, in a cagateway?