Closed StrongestNumber9 closed 5 years ago
Same happens when using omrelp. Seems like queues are the root problem as it will properly keep on retrying when I remove queues.
can you change the queue type to "linkedList" and check if the same situation appears?
It ignored the event limit. Otherwise seems to be working.
# grep 'pszRawMsg' /queue.0000000* | wc -l
304764
# time tcpflood -T relp-plain -t 127.0.0.1 -p 601 -m 10 -M "Should not get accepted"
00000relp connect failed with return 10015
error in trying to open connection i=0
error opening connections
real 1m30.150s
user 0m0.004s
sys 0m0.007s
Edit:
7329.897089364:omfwd queue:DAwpool/w0: omfwd queue: queue.c: ConsumerDA:qqueueEnqMsg item (45) returned with error state: '-2105'
7329.897101396:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.
7331.897354105:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: cond timeout, dropping message!
7331.897412922:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: EnqueueMsg advised worker start
7331.897423834:omfwd queue:DAwpool/w0: omfwd queue: queue.c: ConsumerDA:qqueueEnqMsg item (46) returned with error state: '-2105'
7331.897436529:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.
7333.897633764:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: cond timeout, dropping message!
7333.897689231:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: EnqueueMsg advised worker start
7333.897700204:omfwd queue:DAwpool/w0: omfwd queue: queue.c: ConsumerDA:qqueueEnqMsg item (47) returned with error state: '-2105'
7333.897752648:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.
7335.897917947:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: cond timeout, dropping message!
7335.897968156:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: EnqueueMsg advised worker start
7335.897978084:omfwd queue:DAwpool/w0: omfwd queue: queue.c: ConsumerDA:qqueueEnqMsg item (48) returned with error state: '-2105'
7335.897989754:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.
It ignored the event limit.
That limit is only for the memory part of a DA queue. The disk part does not have any limit other than configured max disk space.
I need to check if delay mode is not properly handled by disk queues.
@StrongestNumber9 I think I know what happens. To go quickly forward, could you do apply a minimal code change to your code. In this line https://github.com/rsyslog/rsyslog/blob/master/plugins/imrelp/imrelp.c#L242 change the LIGHT_DELAY
to FULL_DELAY
. I think that solves the issue you see.
We are very conservative with using FULL_DELAY mode as it blocks the pipeline and this usually is not an expected result. If FULL_DELAY works for you, I'll add a configuration option to make this configurable.
If you cannot patch, let me know. I need then to craft a testbench test first, which takes some extra time.
On Thu, 31 Oct 2019, Strongest Number 9 wrote:
Expected behavior
Imrelp gets event, runs it through rulesets and acks NOT OK if queues are full, informing the client that your data couldn't be dealt with.
I beleive that your expectations are incorrect.
imrelp will ack a message if it is able to add it to the main queue. It doesn't matter if an output is failing and it's not able to deliver the message, or if an action queue is full. It doesn't run through the rulesets at all before doing the ack, it just checks if it can add it to the main queue (or ruleset queue if the input is bound to a ruleset)
David Lang
I used this patch and it doesn't seem to work
diff --git a/plugins/imrelp/imrelp.c b/plugins/imrelp/imrelp.c
index 29f9d7dfc..8f95064a2 100644
--- a/plugins/imrelp/imrelp.c
+++ b/plugins/imrelp/imrelp.c
@@ -241,7 +241,7 @@ onSyslogRcv(void *pUsr, uchar *pHostname, uchar *pIP, uchar *msg, size_t lenMsg)
CHKiRet(msgConstruct(&pMsg));
MsgSetInputName(pMsg, inst->pInputName);
MsgSetRawMsg(pMsg, (char*)msg, lenMsg);
- MsgSetFlowControlType(pMsg, eFLOWCTL_LIGHT_DELAY);
+ MsgSetFlowControlType(pMsg, eFLOWCTL_FULL_DELAY);
MsgSetRuleset(pMsg, inst->pBindRuleset);
pMsg->msgFlags = PARSE_HOSTNAME | NEEDS_PARSING;
# time tcpflood -T relp-plain -t 127.0.0.1 -p 601 -m 10 -M "I should not be accepted"
00001 open connections
starting run 1
Sending 10 messages.
00000010 messages sent
runtime: 0.000
00001 close connections
End of tcpflood Run
real 0m20.016s
user 0m0.004s
sys 0m0.006s
09:18:39.670350 IP (tos 0x0, ttl 64, id 57619, offset 0, flags [DF], proto TCP (6), length 67)
localhost.syslog-conn > localhost.42430: Flags [P.], cksum 0xfe37 (incorrect -> 0x9b0b), seq 164:179, ack 491, win 350, options [nop,nop,TS val 2925994888 ecr 2925992888], length 15
0x0000: 4500 0043 e113 4000 4006 5b9f 7f00 0001 E..C..@.@.[.....
0x0010: 7f00 0001 0259 a5be fe5c 85da a20c 675f .....Y...\....g_
0x0020: 8018 015e fe37 0000 0101 080a ae67 2388 ...^.7.......g#.
0x0030: ae67 1bb8 3620 7273 7020 3620 3230 3020 .g..6.rsp.6.200.
0x0040: 4f4b 0a OK.
09:18:39.670372 IP (tos 0x0, ttl 64, id 338, offset 0, flags [DF], proto TCP (6), length 52)
localhost.42430 > localhost.syslog-conn: Flags [.], cksum 0xfe28 (incorrect -> 0x9dbb), ack 179, win 342, options [nop,nop,TS val 2925994888 ecr 2925994888], length 0
0x0000: 4500 0034 0152 4000 4006 3b70 7f00 0001 E..4.R@.@.;p....
0x0010: 7f00 0001 a5be 0259 a20c 675f fe5c 85e9 .......Y..g_.\..
0x0020: 8010 0156 fe28 0000 0101 080a ae67 2388 ...V.(.......g#.
0x0030: ae67 2388 .g#.
OK, looks like I need to craft a testbench test.
Please note, however, that (as David said https://github.com/rsyslog/rsyslog/issues/3941#issuecomment-548540932) RELP guards the transport. It makes no guarantees beyond transport. Most importantly, there is no relationship between
imrelp -> queue -> omrelp
If you need to have a system with at-least-once delivery guarantee across the relay chain, you need to go for a message queueing system that is designed for this purpose. rsyslog does not provide these guarantees but rather targest sufficient (but not perfect) reliability for the majority of security logging and network management tasks.
That said, I still think that when the message is flagged as FULL_DELAY, it should pushback to the sender when the queue goes above queue.fullDelayMark
(but that's it, nothing more).
I now see the issue. While the main queue is direct, we still have two queues involved - main and action. When the main queue enqueues message to the action queue, it does not apply the delay mode setting. This is to guard against the whole system to become stalled.
I will now add two more config settings:
When both are enabled, the input can even be blocked from "later" actions. This is pretty dangerous in general, thus it will not be the default. It will block the whole system, which is exactly what you want here. These settings will also delay rsyslog shutdown when in blocking state.
@StrongestNumber9 I have crafted PR #3943 to permit configuring the the blocking behavior. The commits have doc on the new options. Basically, you now can do:
$LocalHostName test
$AbortOnUncleanConfig on
$PreserveFQDN on
module(load="imrelp")
global(
workDirectory="/"
maxMessageSize="256k"
)
main_queue(queue.type="Direct")
input(type="imrelp" port="601" ruleset="incoming" MaxDataSize="256k" flowcontrol="full")
ruleset(name="incoming") {
action
(
type="omfwd"
name="omfwd"
RebindInterval="10000"
protocol="tcp"
target="127.0.0.1"
port="10514"
queue.size="1000"
queue.type="Disk"
queue.takeFlowCtlFromMsg="on"
queue.checkpointinterval="65536"
queue.filename="queue"
queue.maxfilesize="32m"
queue.saveonshutdown="on"
queue.syncqueuefiles="off"
queue.maxdiskspace="128m"
ResendLastMSGOnReconnect="on"
action.ResumeInterval="60"
action.resumeRetryCount="-1"
)
}
Note
flowcontrol="full"
queue.takeFlowCtlFromMsg="on"
This is as close as you can get without setting all queues to direct mode.
@StrongestNumber9 can you apply the patches to your version and also give it a try?
I am testing the patch and had some problems. I'll let you know once I have verified that those are not caused by our environment or patches
Hi, I managed to fix the problems in our environment. Now it randomly hangs up on shutdown when I use your configuration.
3597.852974579:omfwd queue:Reg/w0: omfwd.c: TCPSendInit FAILED with -2027.
3597.852988152:omfwd queue:Reg/w0: modules.c: file netstrms.c released module 'lmnsd_ptcp', reference count now 0
3597.852995269:omfwd queue:Reg/w0: modules.c: module 'lmnsd_ptcp' has zero reference count, unloading...
3597.853002043:omfwd queue:Reg/w0: modules.c: Unloading module lmnsd_ptcp
3597.853013793:omfwd queue:Reg/w0: modules.c: file nsd_ptcp.c released module 'lmnetstrms', reference count now 2
3597.853020137:omfwd queue:Reg/w0: modules.c: module lmnetstrms is currently in use by file omfwd.c
3597.853026534:omfwd queue:Reg/w0: modules.c: module lmnetstrms is currently in use by file omfwd.c
3597.856986661:omfwd queue:Reg/w0: omfwd.c: omfwd: doTryResume 127.0.0.1 iRet -2027
3597.857020089:omfwd queue:Reg/w0: ../action.c: actionCallCommitTransaction[omfwd] state: rtry mod commitTransaction returned -2007
3597.857028847:omfwd queue:Reg/w0: ../action.c: action[omfwd] transitioned to state: rtry
3597.857038789:omfwd queue:Reg/w0: errmsg.c: Called LogMsg, msg: action 'omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension.
rsyslog internal message (4,-2007): action 'omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.38.0 try http://www.rsyslog.com/e/2007 ]
3597.857060797:omfwd queue:Reg/w0: ../action.c: actionCommit[omfwd]: in retry loop, iRet -2007
3597.857068047:omfwd queue:Reg/w0: ../action.c: actionCommit[omfwd]: actionDoRetry returned 0
3597.857074011:omfwd queue:Reg/w0: ../action.c: actionTryCommit[omfwd] enter
3597.857082049:omfwd queue:Reg/w0: ../action.c: actionTryResume: action[omfwd] state: rtry, next retry (if applicable): 0 [now 1572943597]
3597.857088702:omfwd queue:Reg/w0: ../action.c: doTransaction: have commitTransaction IF, using that, pWrkrInfo 0x555555864710
3597.857106669:omfwd queue:Reg/w0: ../action.c: entering actionCallCommitTransaction[omfwd], state: rtry, nMsgs 1
3597.857114767:omfwd queue:Reg/w0: omfwd.c: TCPSendInit CREATE
3597.857132422:omfwd queue:Reg/w0: obj.c: caller requested object 'nsd_ptcp', not found (iRet -3003)
3597.857149518:omfwd queue:Reg/w0: modules.c: Requested to load module 'lmnsd_ptcp'
3597.857162873:omfwd queue:Reg/w0: modules.c: loading module 'lmnsd_ptcp.so'
3597.863485450:omfwd queue:Reg/w0: modules.c: source file nsd_ptcp.c requested reference for module 'lmnetstrms', reference count now 3
3597.863552856:main thread : errmsg.c: Called LogMsg, msg: omfwd queue: need to do hard cancellation
rsyslog internal message (4,-3000): omfwd queue: need to do hard cancellation [v8.38.0]
3597.863667782:omfwd queue:Reg/w0: wti.c: omfwd queue:Reg/w0: cancelation cleanup handler called.
3597.863699516:omfwd queue:Reg/w0: queue.c: DeleteProcessedBatch: etry 0 state 0
3597.863729375:omfwd queue:Reg/w0: omfwd queue: queue.c: queue nearly full (1000 entries), but could not drop msg (iRet: 0, severity 7)
3597.863761230:omfwd queue:Reg/w0: omfwd queue: queue.c: doEnqSingleObject: queue FULL - configured for immediate discarding QueueSize=1000 MaxQueueSize=1000 sizeOnDisk=469000 sizeOnDiskMax=134217728
3597.863769568:omfwd queue:Reg/w0: queue.c: DeleteProcessedBatch: error -2105 re-enqueuing unprocessed data element - discarded
3597.863796809:omfwd queue:Reg/w0: queue.c: DeleteProcessedBatch: we deleted 0 objects and enqueued 1 objects
3597.863810570:omfwd queue:Reg/w0: queue.c: doDeleteBatch: delete batch from store, new sizes: log 999, phys 999
3597.863818889:omfwd queue:Reg/w0: wti.c: omfwd queue:Reg/w0: done cancelation cleanup handler.
3597.863836520:omfwd queue:Reg/w0: wtp.c: omfwd queue:Reg: Worker thread 555555864b60 requested to be cancelled.
3597.863851045:omfwd queue:Reg/w0: wtp.c: omfwd queue:Reg: Worker thread 555555864b60, terminated, num workers now 0
3597.863871581:main thread : wti.c: cooperative worker termination failed, using cancellation...
3597.863904562:main thread : wti 0x555555864b60: wti.c: canceling worker thread
3597.863920589:omfwd queue:Reg/w0: debug.c: destructor for debug call stack 0x7fffe80008c0 called
3597.863988355:main thread : omfwd queue: queue.c: worker threads terminated, remaining queue size log 999, phys 999.
3597.864010367:main thread : omfwd queue: queue.c: persisting queue to disk, 999 entries...
3597.864026950:main thread : stream.c: file stream /out/queue.qi.tmp params: flush interval 0, async write 0
3597.865193457:main thread : strm 0x555555862690: stream.c: strmFlushinternal: file 14(/out/queue.00000001) flush, buflen 0 (no need to flush)
3597.865225594:main thread : stream.c: strmSerialize: pThis->prevLineSegment (nil)
3597.865274337:main thread : strm 0x555555862bd0: stream.c: strmFlushinternal: file -1(queue) flush, buflen 0 (no need to flush)
3597.865293099:main thread : stream.c: strmSerialize: pThis->prevLineSegment (nil)
3597.865306566:main thread : strm 0x55555586a090: stream.c: file -1(/out/queue.qi.tmp) closing, bDeleteOnClose 0
3597.865314439:main thread : strm 0x55555586a090: stream.c: strmFlushinternal: file -1(/out/queue.qi.tmp) flush, buflen 559
3597.865322678:main thread : strm 0x55555586a090: stream.c: file -1(/out/queue.qi.tmp) doWriteInternal: bFlush 0
3597.865329779:main thread : stream.c: strmPhysWrite, stream 0x55555586a090, len 559
[Thread 0x7fffeffff700 (LWP 149) exited]
3597.865436170:main thread : stream.c: file '/out/queue.qi.tmp' opened as #12 with mode 384
3597.865456275:main thread : strm 0x55555586a090: stream.c: opened file '/out/queue.qi.tmp' for WRITE as 12
3597.865487387:main thread : strm 0x55555586a090: stream.c: file 12 write wrote 559 bytes
3597.865533651:main thread : strm 0x555555862690: stream.c: file 14(/out/queue.00000001) closing, bDeleteOnClose 0
3597.865560355:main thread : strm 0x555555862690: stream.c: strmFlushinternal: file 14(/out/queue.00000001) flush, buflen 0 (no need to flush)
3597.865575084:main thread : strm 0x555555862930: stream.c: file 15(/out/queue.00000001) closing, bDeleteOnClose 0
3597.865592250:main thread : strm 0x555555862bd0: stream.c: file -1(queue) closing, bDeleteOnClose 0
3597.865620212:main thread : ruleset.c: destructing ruleset 0x5555558580e0, name 0x555555842bf0
3597.865636107:main thread : ruleset.c: destructing ruleset 0x555555864d50, name 0x5555558643a0
3597.865644395:main thread : rsyslogd.c: all primary multi-thread sources have been terminated - now doing aux cleanup...
3597.865659320:main thread : rsyslogd.c: destructing current config...
3597.865667871:main thread : rsconf.c: calling freeCnf(0x5555558567d0) for module 'builtin:omfile'
3597.865683218:main thread : rsconf.c: calling freeCnf(0x55555583ffd0) for module 'builtin:ompipe'
3597.865692144:main thread : rsconf.c: calling freeCnf(0x555555841a10) for module 'builtin:omfwd'
3597.865700874:main thread : rsconf.c: calling freeCnf(0x555555863df0) for module 'imrelp'
3597.865817198:main thread : parser.c: destructing parser 'rsyslog.rfc5424'
3597.865833410:main thread : parser.c: destructing parser 'rsyslog.rfc3164'
3597.865840794:main thread : pmrfc3164.c: pmrfc3164: free parser instance 0x555555841f90
Hi, I managed to fix the problems in our environment. Now it randomly hangs up on shutdown when I use your configuration.
yeah, that's why I caution against using this option. There was a reason that we do not use delay mode when going to the DA queue. I am pretty sure the problem is now that you run into a deadlock sooner or later when the queue is still busy. I can try to find a tie break in this case and then force.-force-cancel the thread, but that ultimately also leads to message loss. I would suggest that I take another couple of hours to look at this and provide feedback if it turns out to become another large refactoring? What do you think?
One idea: I could turn off handling of light delay mode during shutdown. But than again you loose the message currently being processed, just like it did in the beginning.
Is it possible to make shutdown so it stop inputs, wait action to finish retry and then close the queue once retried message has been re-inserted?
Well, the core problem is that the queue is full. So the enqueue waits until there is room. The action is unable to process any further messages. So the queue will never become less full. As such it needs to wait, eternally by config settings, until there is room.
One thing I could do is ignore the queue size limit during shutdown. But than again you run into issues if e.g. you run out of space. Nevertheless, this could work for more cases.
The bottom line is that at some points in processing you actually need to discard some messages if circumstances are so. With a zero-loss system (which rsyslog is not) you can go a long milage, but even with such a transactional message queueing system the log genenertor must at some point decide whether to cease processing or lose some logs. Otherwise it's simply impossible to do.
Would it be possible to set the queue limit to something like "queue.size - action.batchsize" for inputs so there will always be space left for the action to re-insert on-the-fly content? We are aware that there are circumstances where data may eventually get lost but avoiding it when in normal operation mode is desirable.
avoiding it when in normal operation mode is desirable.
This is something rsyslog tries hard to do. But a shutdown during stalled connection to a remote site with a full queue IMHO is out of normal operations mode.
Would it be possible to set the queue limit to something like "queue.size - action.batchsize" for inputs so there will always be space left for the action to re-insert on-the-fly content?
Doesn't work, because multiple inputs can submit data to the same queue. What we can do is ignore the queue size limitation (at least I think so). Maybe I can give that a try, that's hopefully not too much to do for an experiment.
I think it may be a good idea to "let ripen" potential solutions for a day or two. I would still say increasing the queue size on shutdown might be the best option. At that point, inputs are already terminated, so no new messages would come in.
I agree with letting it ripen for few days. Increasing queue size on shutdown also works for us.
@StrongestNumber9 can you reliably reproduce the "bad timing"? I do not manage to get it. In any case, can you mail me a full debug log (well, actually from the point where the imrelp input is being delayed for the first time would be sufficient).
I'd like to see more in-depth how the timing of events is.
I managed to reproduce it fairly reliably by sending one message to queues, then shutting down rsyslog and restarting it up again using rsyslogd -f rsyslog.conf -dn 2>&1 | tee output.log
. I sent you the full log to the email on your profile.
Was the receiver active during your procedure? I guess "no", but want to confirm.
Not on the log I sent you, but hang happened in both cases as long as there were events in queues. Clean shutdown if queues were empty and no events received.
I made an experimental commit to my PR, switching the queue default. Maybe some existing tests catch it as well (would be best).
I reviewed the debug log you sent. It's interesting, actually no relp message was handled in that case in any way.
Would it be possible for you to re-try with current master branch plus the new patches and see if the problems during shutdown persist? I just wanted to rule out that it has something to do with your old version plus custom patches. Once I know it also happens to the "regular" code base, it's possibly easier to check what is going on.
My experiment with the changed default did not bring up any issues. It's not yet 100% finished, but sufficient environments, including CentOS, did pass. I'll still monitor it.
I tested the v8.1910.0 version with only #3943 patched in and I still get the hangup but it is not consistently happening. I sent you more detailed information and logs of our environment.
thx again
Just as a note: I can't reproduce it using just -n. I am not sure if -d enables some race condition or if -d is the culprit itself.
We suspect that thread "omfwd queue:Reg/w0" dies within ReleaseObj and therefore lock is never released causing a deadlock.
0332.820284601:main thread : operatingstate.c: osf: MSG omfwd queue: need to do hard cancellation: 0332.820287316:omfwd queue:Reg/w0: obj.c: pIf->ifIsLoaded is set to 1rsyslog internal message (4,-3000): omfwd queue: need to do hard cancellation [v8.1910.0]
Debug prints added:
diff --git a/runtime/obj.c b/runtime/obj.c
index ad76ba9ee..3d3fc05e9 100644
--- a/runtime/obj.c
+++ b/runtime/obj.c
@@ -1261,19 +1261,22 @@ ReleaseObj(const char *srcFile, uchar *pObjName, uchar *pObjFile, interface_t *p
DEFiRet;
objInfo_t *pObjInfo;
- /* dev debug only dbgprintf("source file %s releasing object '%s',
- ifIsLoaded %d\n", srcFile, pObjName, pIf->ifIsLoaded); */
+ dbgprintf("source file %s releasing object '%s', ifIsLoaded %d\n", srcFile, pObjName, pIf->ifIsLoaded);
pthread_mutex_lock(&mutObjGlobalOp);
- if(pObjFile == NULL)
+ if(pObjFile == NULL) {
+ dbgprintf("pObjFile == NULL");
FINALIZE; /* if it is not a lodable module, we do not need to do anything... */
-
+ }
if(pIf->ifIsLoaded == 0) {
+ dbgprintf("pIf->ifIsLoaded == 0");
FINALIZE; /* we are not loaded - this is perfectly OK... */
} else if(pIf->ifIsLoaded == 2) {
+ dbgprintf("pIf->ifIsLoaded == 2");
pIf->ifIsLoaded = 0; /* clean up */
FINALIZE; /* we had a load error and can not/must not continue */
}
+ dbgprintf("pIf->ifIsLoaded is set to %d", pIf->ifIsLoaded);
CHKiRet(FindObjInfo((const char*)pObjName, &pObjInfo));
@StrongestNumber9 thx - that's a good catch. I'll have a look hopefully after noon.
FYI: the debug log you sent also has some interesting insight ... too early to comment in depth, but I wanted to let you know.
@StrongestNumber9 which version of librelp do you use? Is it plain or does it have custom patches? -- Just trying to set the baseline.
All used libraries are vanilla without patches.
liblogging v1.0.6 libestr v0.1.10 libfastjson v0.99.8 liblognorm v2.0.4 librelp v1.2.17
@StrongestNumber9 I think there is a problem in imrelp that at least contributes to the issue here. See https://github.com/rsyslog/librelp/issues/175. While I think it is not the sole cause, it would be great if you could apply a patch to librelp once I have it ready. It should go on to of 1.4.0, but would most probably also work with older source versions.
@StrongestNumber9 librelp PR looks good - would be great if you could apply https://github.com/rsyslog/librelp/pull/176/files
@StrongestNumber9 Good news! I could just accidentally reproduce the hang in my environment. After I got it, I could craft a script which can reproduce it by doing sufficient iterations until it fails. So it looks like I can work on the "real thing" now.
Side-note: I have the librelp patch active. So it does not solve that case. I would still recommend to apply it (and give it a try), as without the fix the likelyhood of problems is greater.
We suspect that thread "omfwd queue:Reg/w0" dies within ReleaseObj and therefore lock is never released causing a deadlock.
I can outrule this now: the problem even happens when the mutex is not locked and unlocked at all. While it was still present, I have also seen a hang at a different location.
I now think it is more likely due to the debugging instrumentation. I emit debug messages as places and at times where it is not totally safe to do so. The alternative is to not emit them, so I decided to do it albeit it is not 100% correct. After all, debug mode is for developers who chase bugs. I will try to verify this in more depth. I have not seen any real complaint from valgrind/helgrind, but I will probably do another session with TSAN tomorrow.
As far as I can see, the commits I made influence timing and thus the problem occurs only with them, but it looks like they are correct. Still to be confirmed, of course.
@StrongestNumber9 It looks like I found the culprit. It's actually debug code, like I suspected. I use a write
, which is not signal-safe, during a signal handler. This is useful for debugging, but obviously causes occasional problems. I will now remove them for the time being, but I cannot outrule other similar issues exists. If I remove all of them completely, debugging becomes much harder. So the rest will stay.
@StrongestNumber9 pls apply https://github.com/rgerhards/rsyslog/commit/6518a92e387cb68a048aa0e6e4c844e8baf97591 to your version. For me this made the problem reliably go away (at least for several hundered test runs).
TSAN also complained before the fix but not after.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Expected behavior
Imrelp gets event, runs it through rulesets and acks NOT OK if queues are full, informing the client that your data couldn't be dealt with.
Actual behavior
Imrelp gets event, runs it through rulesets and acks OK even if omfwd queue is full and target is not answering. This causes silent loss of events by rate of roughly 2 seconds per event. In this case omfwd doesn't respect action.resumeInterval and action.resumeRetryCount settings.
Steps to reproduce the behavior
rsyslog.conf
Commands executed
tcpdump confirming accepted and ack'd events
Events seems to be limited by the drain interval. It also doesn't realize it already is full at that point.
Environment