Open bfrk opened 4 years ago
Probably fixed in commit 8b9359 which is included from 2.8.11 on. Please update and try again.
Unfortunately, upgrading did not help. I have tried with 2.8.16. The core dump says:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f914d054db1 in epicsMutexOsdLock (pmutex=0x56130bec9230)
at ../../../src/libCom/osi/os/posix/osdMutex.c:127
127 {
[Current thread is 1 (Thread 0x7f9147dfb700 (LWP 20781))]
(gdb) bt
#0 0x00007f914d054db1 in epicsMutexOsdLock (pmutex=0x56130bec9230)
at ../../../src/libCom/osi/os/posix/osdMutex.c:127
#1 0x00007f914d04dd59 in epicsMutexLock (pmutexNode=<optimized out>)
at ../../../src/libCom/osi/epicsMutex.cpp:143
#2 0x00007f914d04df0c in epicsMutex::lock (this=<optimized out>)
at ../../../src/libCom/osi/epicsMutex.cpp:244
#3 0x00007f914d9e86b8 in StreamCore::MutexLock::MutexLock (_stream=0x56130bec8bb0,
this=<synthetic pointer>) at ../StreamCore.h:121
#4 StreamCore::readCallback (this=0x56130bec8bb0, status=StreamIoNoReply, input=0x0,
size=0) at ../StreamCore.cc:956
#5 0x00007f914d9e3b54 in StreamCore::evalIn (this=this@entry=0x56130bec8bb0)
at ../StreamCore.cc:919
#6 0x00007f914d9e8804 in StreamCore::readCallback (this=0x56130bec8bb0,
status=<optimized out>, input=<optimized out>, size=<optimized out>)
at ../StreamCore.cc:1107
#7 0x00007f914d9e3b54 in StreamCore::evalIn (this=this@entry=0x56130bec8bb0)
at ../StreamCore.cc:919
#8 0x00007f914d9e8804 in StreamCore::readCallback (this=0x56130bec8bb0,
status=<optimized out>, input=<optimized out>, size=<optimized out>)
at ../StreamCore.cc:1107
[...more of the same...]
#1268 0x00007f914d9e8804 in StreamCore::readCallback (this=0x56130bec8bb0,
status=<optimized out>, input=<optimized out>, size=<optimized out>)
at ../StreamCore.cc:1107
#1269 0x00007f914d9cc57d in StreamBusInterface::readCallback (this=0x56130bec9310,
size=0, input=0x0, status=StreamIoNoReply) at ../StreamBusInterface.h:121
#1270 AsynDriverInterface::readHandler (this=this@entry=0x56130bec9310)
at ../AsynDriverInterface.cc:1054
#1271 0x00007f914d9ce7d8 in AsynDriverInterface::handleRequest (this=0x56130bec9310)
at ../AsynDriverInterface.cc:1513
#1272 0x00007f914e659909 in portThread (pport=0x56130be7b7c0)
at ../../asyn/asynDriver/asynManager.c:902
#1273 0x00007f914d052b40 in start_routine (arg=0x56130be7bd70)
at ../../../src/libCom/osi/os/posix/osdThread.c:403
#1274 0x00007f914b96a4a4 in start_thread (arg=0x7f9147dfb700) at pthread_create.c:456
#1275 0x00007f914bc68d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
I will have a look. I assume your record is an ai? What is your device sending? Is it TCP or serial?
Luckily(?) it crashed almost immediately after I enabled asyn tracing (it usually takes a few miniutes after restart). These are the IOC's last words:
SIOC66C>asynSetTraceIOMask PS3 0 0x2
SIOC66C>asynSetTraceMask PS3 0 0x9
SIOC66C>2020/08/20 13:00:18.737 192.168.21.139:4003 read 69
@\005@\0200\002RD\003\025\0201\004`\005`\0200\002|0062|2|1|0|1241|0|0.7665E-1 \003\033\0201\004\004A\005A\0200\002RD\003\025\0201\004a\005
2020/08/20 13:00:18.740 192.168.21.139:4003 read 3
a\0200
2020/08/20 13:00:18.750 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:18.789 192.168.21.139:4003 read 32
|0062|2|9|1|189|400|0.1377E-1 \003
2020/08/20 13:00:18.802 192.168.21.139:4003 read 2
\0201
2020/08/20 13:00:18.807 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:18.817 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:18.827 192.168.21.139:4003 read 2
B\005
2020/08/20 13:00:19.929 192.168.21.139:4003 read 107
B\0200\002RD\003\025\0201\004b\005b\0200\002|0062|2|1|0|1260|0|0.8232E-1 \003\021\0201\004\004C\005C\0200\002RD\003\025\0201\004c\005c\0200\002|0062|2|9
2020/08/20 13:00:19.930 192.168.21.139:4003 read 2
D\005
2020/08/20 13:00:21.031 192.168.21.139:4003 read 164
D\0200\002RD\003\025\0201\004d\005d\0200\002|0062|2|1|0|1206|0|0.7924E-1 \003\022\0201\004\004E\005E\0200\002RD\003\025\0201\004e\005e\0200\002|0062|2|9
2020/08/20 13:00:22.133 192.168.21.139:4003 read 117
G\0200\002RD\003\025\0201\004g\005g\0200\002|0062|2|9|1|229|400|0.3521E-2 \003-\0201\004\004H\005H\0200\002RD\003\025\0201\004h\005h\0200\002|0062|2|
2020/08/20 13:00:22.145 192.168.21.139:4003 read 5
\0201\004i\005
2020/08/20 13:00:22.196 192.168.21.139:4003 read 3
i\0200
2020/08/20 13:00:22.202 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:22.247 192.168.21.139:4003 read 33
|0062|2|9|1|188|400|-0.2593E-2 \003\000
2020/08/20 13:00:22.261 192.168.21.139:4003 read 3
\0201\004
2020/08/20 13:00:22.273 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:24.376 192.168.21.139:4003 read 111
P\005P\0200\002RD\003\025\0201\004p\005p\0200\002|0062|2|1|0|1264|0|0.6131E-1 \003\033\0201\004\004Q\005Q\0200\002RD\003\025\0201\004q\005q\0200\002|0062|2
2020/08/20 13:00:29.487 192.168.21.139:4003 read 70
\004R\005R\0200\002RD\003\025\0201\004r\005r\0200\002|0062|2|1|0|1262|0|0.8247E-1 \003\021\0201\004\004S\005S\0200\002RD\003\025\0201\004s\005
2020/08/20 13:00:29.525 192.168.21.139:4003 read 3
s\0200
2020/08/20 13:00:29.531 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:29.574 192.168.21.139:4003 read 32
|0062|2|9|1|240|400|0.4661E-2 \003\"
2020/08/20 13:00:29.586 192.168.21.139:4003 read 3
\0201\004
2020/08/20 13:00:29.598 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:29.608 192.168.21.139:4003 read 2
T\005
2020/08/20 13:00:29.703 192.168.21.139:4003 read 3
T\0200
2020/08/20 13:00:29.713 192.168.21.139:4003 read 5
\002RD\003\025
2020/08/20 13:00:29.727 192.168.21.139:4003 read 5
\0201\004t\005
2020/08/20 13:00:29.775 192.168.21.139:4003 read 3
t\0200
2020/08/20 13:00:29.781 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:29.824 192.168.21.139:4003 read 32
|0062|2|10|0|1258|0|0.7930E-1 \003,
2020/08/20 13:00:29.843 192.168.21.139:4003 read 3
\0201\004
2020/08/20 13:00:29.853 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:29.865 192.168.21.139:4003 read 2
U\005
2020/08/20 13:00:29.872 192.168.21.139:4003 read 3
U\0200
2020/08/20 13:00:29.889 192.168.21.139:4003 read 5
\002RD\003\025
2020/08/20 13:00:29.904 192.168.21.139:4003 read 5
\0201\004u\005
2020/08/20 13:00:29.953 192.168.21.139:4003 read 3
u\0200
2020/08/20 13:00:29.959 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:30.002 192.168.21.139:4003 read 32
|0062|2|9|1|205|400|0.1771E-1 \003%
2020/08/20 13:00:30.018 192.168.21.139:4003 read 3
\0201\004
2020/08/20 13:00:30.029 192.168.21.139:4003 read 1
\004
@@@ Current time: [2020-08-20 13:00:31]
@@@ Child process is shutting down, a new one will be restarted shortly
The device talks serial but it is connected via terminal server, like this:
serial via drvAsynIPPortConfigure ("PS3", "192.168.21.139:4003",0,0,0)
I should perhaps mention that the actual protocol used by the device is something we are currently in the process of reverse engineering.The protocol file as listed above above was the first thing we tried because it has worked (more or less) a few years ago on similar devices. I guess that by accident we happened to encounter a code path that is not used under normal circumstances.
My analysis so far:
StreamCore::evalIn()
calls StreamCore::readCallback()
if there is (still/already) data in the input buffer at the beginning of executing the in
command.
StreamCore::readCallback()
calls StreamCore::evalIn()
for SCAN="I/O Intr" records in case of a read timeout (unterminated=incomplete input line), in order to restart the (should be blocking) in
command.
However, it should do so with the old input discarded (unparsedInput = false
) and evalIn
only calls back to readCallback
if (unparsedInput)
. I do notunderstand yet how this can end up in a recursion loop.....
Do you still have the core dump? Can you print *this? Can you try to switch off optimization? Is this a normal PC or some embedded system?
Please run again with var StreamDebug 1
.
Do you still have the core dump? Can you print *this?
Yes and yes. I guess the one you are interested in is at the bottom:
(gdb) up
#1 0x00007f914d04dd59 in epicsMutexLock (pmutexNode=<optimized out>)
at ../../../src/libCom/osi/epicsMutex.cpp:143
143 epicsMutexOsdLock(pmutexNode->id);
(gdb) print *this
No symbol "this" in current context.
(gdb) up
#2 0x00007f914d04df0c in epicsMutex::lock (this=<optimized out>)
at ../../../src/libCom/osi/epicsMutex.cpp:244
244 epicsMutexLockStatus status = epicsMutexLock ( this->id );
(gdb) print *this
value has been optimized out
(gdb) up
#3 0x00007f914d9e86b8 in StreamCore::MutexLock::MutexLock (_stream=0x56130bec8bb0,
this=<synthetic pointer>) at ../StreamCore.h:121
121 { _stream->lockMutex(); }
(gdb) print *this
$1 = {stream = 0x56130bec8bb0}
(gdb) up
#4 StreamCore::readCallback (this=0x56130bec8bb0, status=StreamIoNoReply, input=0x0,
size=0) at ../StreamCore.cc:956
956 MutexLock lock(this);
(gdb) print *this
$2 = {<StreamProtocolParser::Client> = {<No data fields>}, <StreamBusInterface::Client> = {<No data fields>}, next = 0x56130bec9720, static first = 0x56130beaf7f0,
streamname = 0x56130bfb38f0 "RADZC:B3S2", flags = 261, protocolname = {
local = "readSonde\000\061\000\060\000\061\062\066\060\000\060", '\000' <repeats 43 times>, len = 20, cap = 64, offs = 0, buffer = 0x56130bec8be0 "readSonde"},
lockTimeout = 5000, writeTimeout = 100, replyTimeout = 1000, readTimeout = 100,
pollPeriod = 1000, maxInput = 0, inTerminatorDefined = true,
outTerminatorDefined = true, inTerminator = {
local = "\003", '\000' <repeats 62 times>, len = 1, cap = 64, offs = 0,
buffer = 0x56130bec8c78 "\003"}, outTerminator = {
local = "\003", '\000' <repeats 62 times>, len = 1, cap = 64, offs = 0,
buffer = 0x56130bec8cd8 "\003"}, separator = {local = '\000' <repeats 63 times>,
len = 0, cap = 64, offs = 0, buffer = 0x56130bec8d38 ""}, commands = {
local = "\001%*/\033|0062\033|2\033|1\033|0\033|1260\033|0\033|/%f", '\000' <repeats 30 times>, len = 33, cap = 64, offs = 0,
buffer = 0x56130bec8d98 "\001%*/\033|0062\033|2\033|1\033|0\033|1260\033|0\033|/%f"}, onInit = {local = '\000' <repeats 63 times>, len = 0, cap = 64, offs = 0,
buffer = 0x56130bec8df8 ""}, onWriteTimeout = {local = '\000' <repeats 63 times>,
len = 0, cap = 64, offs = 0, buffer = 0x56130bec8e58 ""}, onReplyTimeout = {
local = '\000' <repeats 63 times>, len = 0, cap = 64, offs = 0,
buffer = 0x56130bec8eb8 ""}, onReadTimeout = {local = '\000' <repeats 63 times>,
len = 0, cap = 64, offs = 0, buffer = 0x56130bec8f18 ""}, onMismatch = {
local = '\000' <repeats 63 times>, len = 0, cap = 64, offs = 0,
buffer = 0x56130bec8f78 ""},
commandIndex = 0x56130bec8d99 "%*/\033|0062\033|2\033|1\033|0\033|1260\033|0\033|/%f", activeCommand = 1 '\001', outputLine = {local = '\000' <repeats 63 times>, len = 0,
cap = 64, offs = 0, buffer = 0x56130bec8fe8 ""}, inputBuffer = {
local = '\000' <repeats 63 times>, len = 5, cap = 256, offs = 103,
---Type <return> to continue, or q <return> to quit---
buffer = 0x7f9130000a10 "T\020\060\002RD\003\025\020\061\004t\005t\020\060\002|0062|2|10|0|1258|0|0.7997E-1 \003!\020\061\004\004U\005U\020\060\002RD\003\025\020\061\004u\005u\020\060\002|0062|2|9|1|205|400|0.1094E-1 \003)\020\061\004\004"}, inputLine = {
local = "\025\020\061\004u\005u\020\060\002|0062|2|9|1|205|400|0.1094E-1 ", '\000' <repeats 23 times>, len = 40, cap = 64, offs = 0,
buffer = 0x56130bec90a8 "\025\020\061\004u\005u\020\060\002|0062|2|9|1|205|400|0.1094E-1 "}, consumedInput = 0, runningHandler = StreamCore::Success, fieldAddress = {
local = '\000' <repeats 63 times>, len = 0, cap = 64, offs = 0,
buffer = 0x56130bec9118 ""}, lastInputStatus = StreamIoNoReply,
unparsedInput = true}
Can you try to switch off optimization?
I think so, but it's not quite as simple. Will try.
Is this a normal PC or some embedded system?
As far as In know it's a "blade" (whatever that means exactly, I can ask if it is important), so neither a normal PC nor really something embedded. This is the same hardware that most of our servers run on. Definitely linux-x86_64.
Please run again with var StreamDebug 1.
Will do.
Thanks. I see unparsedInput = true
which caused the branch in evalIn
be taken to call readCallback
. There are indeed 40 bytes in inputBuffer
. This is in frame #4
. But in frame #6
we are in line 1107, just after:
unparsedInput = false;
inputBuffer.clear();
Thus nothing should be in inputBuffer
and unparsedInput
should be false
. Except of course something else has re-filled the buffer meanwhile in another thread. However, the only method doing that is readCallback()
itself. And we have the semaphore, thus no other thread can call readCallback()
at the moment.
Hmm...
Frame#1269
calls StreamCore::readCallback
the first time with status=StreamIoNoReply
, which has been stored in this->lastInputStatus
. This should end up taking the branch case StreamIoNoReply
in line 985 and then as we are in AsyncMode
(this->flags=261
=0x105, AsyncMode=0x0004
) continue to line 991 calling evalIn()
in order to restart the protocol (assuming inputBuffer
is empty) and then return. Instead we somehow end up in line 1107 in frame #1268
as is we had got input. I don't understand how.
Maybe the optimizer has decided to replace lines 991,992 with the equivalent of "goto line 1107" (which executes the same code). In that case the stack dump would be misleading and we are recursing from line 991, not 1107. Switching on var streamDebug 1
should show the difference.
I am currently having trouble to reproduce the crash, working on it. When/if I can reproduce I will make the changes you suggested and follow up.
I have finally come around to reproducing the crash with the requested changes: disable optimizations, and set var streamDebug 1
. The log of the IOC output with var streamDebug is too large to attach here (35MiB). You can download both the backtrace and the log from https://www-csr.bessy.de/control/scratch/TRACE and https://www-csr.bessy.de/control/scratch/LOG.
We have a situation where our IOC crashes inside libstream.so. There are lots of communication errors right from the start. The protocol file is:
The core dump suggests that it runs into a stack overflow due to an endless recursion between
StreamCore::readCallback
andStreamCore::evalIn
:and so on and so on (> 1300 stack entries). The StreamDevice version is 2.8.10.