Imrelp discards events and acks them OK even if omfwd can't forward them and queue is full

StrongestNumber9 commented 5 years ago

Expected behavior

Imrelp gets event, runs it through rulesets and acks NOT OK if queues are full, informing the client that your data couldn't be dealt with.

Actual behavior

Imrelp gets event, runs it through rulesets and acks OK even if omfwd queue is full and target is not answering. This causes silent loss of events by rate of roughly 2 seconds per event. In this case omfwd doesn't respect action.resumeInterval and action.resumeRetryCount settings.

Steps to reproduce the behavior

rsyslog.conf

$LocalHostName test
$AbortOnUncleanConfig on
$PreserveFQDN on

module(load="imrelp")

global(
    workDirectory="/"
    maxMessageSize="256k"
)
main_queue(queue.type="Direct")
input(type="imrelp" port="601" ruleset="incoming" MaxDataSize="256k")

ruleset(name="incoming") {
    action
    (
        type="omfwd"
        name="omfwd"
        RebindInterval="10000"
        protocol="tcp"
        target="127.0.0.1"
        port="10514"
        queue.size="1000"
        queue.type="Disk"
        queue.checkpointinterval="65536"
        queue.filename="queue"
        queue.maxfilesize="32m"
        queue.saveonshutdown="on"
        queue.syncqueuefiles="off"
        queue.maxdiskspace="128m"
        ResendLastMSGOnReconnect="on"
        action.ResumeInterval="60"
        action.resumeRetryCount="-1"
    )
}

Commands executed

[root@example /]# rsyslogd -f /rsyslog.conf -N9
rsyslogd: version 8.38.0, config validation run (level 9), master config /rsyslog.conf
rsyslogd: End of config validation run. Bye.

[root@example /]# tcpflood -T relp-plain -t 127.0.0.1 -p 601 -m 1000 -M "Filling the queue"
00001 open connections
starting run 1
Sending 1000 messages.
00001000 messages sent
runtime: 0.416
00001 close connections
End of tcpflood Run

[root@example /]# grep 'pszRawMsg' /queue.00000001 | wc -l
1000

[root@example /]# tcpflood -T relp-plain -t 127.0.0.1 -p 601 -m 10 -M "Should not get accepted"
00001 open connections
starting run 1
Sending 10 messages.
00000010 messages sent
runtime: 0.001
00001 close connections
End of tcpflood Run

[root@example /]# time tcpflood -T relp-plain -t 127.0.0.1 -p 601 -m 10 -M "Should not get accepted"
00001 open connections
starting run 1
Sending 10 messages.
00000010 messages sent
runtime: 0.000
00001 close connections
End of tcpflood Run

real    0m20.018s
user    0m0.005s
sys     0m0.005s

[root@example /]# grep 'pszRawMsg' /queue.00000001 | wc -l
1000

tcpdump confirming accepted and ack'd events

12:15:20.530852 IP (tos 0x0, ttl 64, id 62265, offset 0, flags [DF], proto TCP (6), length 398)
    localhost.59550 > localhost.syslog-conn: Flags [P.], cksum 0xff82 (incorrect -> 0xc2e8), seq 135:481, ack 104, win 342, options [nop,nop,TS val 2850199909 ecr 2850199909], length 346
        0x0000:  4500 018e f339 4000 4006 482e 7f00 0001  E....9@.@.H.....
        0x0010:  7f00 0001 e89e 0259 ef70 047d 1a87 cc2d  .......Y.p.}...-
        0x0020:  8018 0156 ff82 0000 0101 080a a9e2 9965  ...V...........e
        0x0030:  a9e2 9965 3320 7379 736c 6f67 2032 3420  ...e3.syslog.24.
        0x0040:  5368 6f75 6c64 206e 6f74 2067 6574 2061  Should.not.get.a
        0x0050:  6363 6570 7465 640a 0a34 2073 7973 6c6f  ccepted..4.syslo
        0x0060:  6720 3234 2053 686f 756c 6420 6e6f 7420  g.24.Should.not.
        0x0070:  6765 7420 6163 6365 7074 6564 0a0a 3520  get.accepted..5.
        0x0080:  7379 736c 6f67 2032 3420 5368 6f75 6c64  syslog.24.Should
        0x0090:  206e 6f74 2067 6574 2061 6363 6570 7465  .not.get.accepte
        0x00a0:  640a 0a36 2073 7973 6c6f 6720 3234 2053  d..6.syslog.24.S
        0x00b0:  686f 756c 6420 6e6f 7420 6765 7420 6163  hould.not.get.ac
        0x00c0:  6365 7074 6564 0a0a 3720 7379 736c 6f67  cepted..7.syslog
        0x00d0:  2032 3420 5368 6f75 6c64 206e 6f74 2067  .24.Should.not.g
        0x00e0:  6574 2061 6363 6570 7465 640a 0a38 2073  et.accepted..8.s
        0x00f0:  7973 6c6f 6720 3234 2053 686f 756c 6420  yslog.24.Should.
        0x0100:  6e6f 7420 6765 7420 6163 6365 7074 6564  not.get.accepted
        0x0110:  0a0a 3920 7379 736c 6f67 2032 3420 5368  ..9.syslog.24.Sh
        0x0120:  6f75 6c64 206e 6f74 2067 6574 2061 6363  ould.not.get.acc
        0x0130:  6570 7465 640a 0a31 3020 7379 736c 6f67  epted..10.syslog
        0x0140:  2032 3420 5368 6f75 6c64 206e 6f74 2067  .24.Should.not.g
        0x0150:  6574 2061 6363 6570 7465 640a 0a31 3120  et.accepted..11.
        0x0160:  7379 736c 6f67 2032 3420 5368 6f75 6c64  syslog.24.Should
        0x0170:  206e 6f74 2067 6574 2061 6363 6570 7465  .not.get.accepte
        0x0180:  640a 0a31 3220 636c 6f73 6520 300a       d..12.close.0.
12:15:20.530861 IP (tos 0x0, ttl 64, id 15111, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.syslog-conn > localhost.59550: Flags [.], cksum 0xfe28 (incorrect -> 0x29d8), ack 481, win 350, options [nop,nop,TS val 2850199909 ecr 2850199909], length 0
        0x0000:  4500 0034 3b07 4000 4006 01bb 7f00 0001  E..4;.@.@.......
        0x0010:  7f00 0001 0259 e89e 1a87 cc2d ef70 05d7  .....Y.....-.p..
        0x0020:  8010 015e fe28 0000 0101 080a a9e2 9965  ...^.(.........e
        0x0030:  a9e2 9965                                ...e
12:15:22.490630 IP (tos 0x0, ttl 64, id 15112, offset 0, flags [DF], proto TCP (6), length 67)
    localhost.syslog-conn > localhost.59550: Flags [P.], cksum 0xfe37 (incorrect -> 0x1baa), seq 104:119, ack 481, win 350, options [nop,nop,TS val 2850201868 ecr 2850199909], length 15
        0x0000:  4500 0043 3b08 4000 4006 01ab 7f00 0001  E..C;.@.@.......
        0x0010:  7f00 0001 0259 e89e 1a87 cc2d ef70 05d7  .....Y.....-.p..
        0x0020:  8018 015e fe37 0000 0101 080a a9e2 a10c  ...^.7..........
        0x0030:  a9e2 9965 3220 7273 7020 3620 3230 3020  ...e2.rsp.6.200.
        0x0040:  4f4b 0a                                  OK.
12:15:22.531118 IP (tos 0x0, ttl 64, id 62266, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.59550 > localhost.syslog-conn: Flags [.], cksum 0xfe28 (incorrect -> 0x1a5a), ack 119, win 342, options [nop,nop,TS val 2850201909 ecr 2850201868], length 0
        0x0000:  4500 0034 f33a 4000 4006 4987 7f00 0001  E..4.:@.@.I.....
        0x0010:  7f00 0001 e89e 0259 ef70 05d7 1a87 cc3c  .......Y.p.....<
        0x0020:  8010 0156 fe28 0000 0101 080a a9e2 a135  ...V.(.........5
        0x0030:  a9e2 a10c                                ....
12:15:24.491314 IP (tos 0x0, ttl 64, id 15113, offset 0, flags [DF], proto TCP (6), length 67)
    localhost.syslog-conn > localhost.59550: Flags [P.], cksum 0xfe37 (incorrect -> 0x0afa), seq 119:134, ack 481, win 350, options [nop,nop,TS val 2850203869 ecr 2850201909], length 15
        0x0000:  4500 0043 3b09 4000 4006 01aa 7f00 0001  E..C;.@.@.......
        0x0010:  7f00 0001 0259 e89e 1a87 cc3c ef70 05d7  .....Y.....<.p..
        0x0020:  8018 015e fe37 0000 0101 080a a9e2 a8dd  ...^.7..........
        0x0030:  a9e2 a135 3320 7273 7020 3620 3230 3020  ...53.rsp.6.200.
        0x0040:  4f4b 0a                                  OK.

Events seems to be limited by the drain interval. It also doesn't realize it already is full at that point.

4375.494549994:imrelp.c       : pmrfc3164.c: Message will now be parsed by the legacy syslog parser (offAfterPRI=0, lenMsg=23.
4375.494559145:imrelp.c       : datetime.c: ParseTIMESTAMP3339: invalid year: 0, pszTS: 'h'
4375.494570733:imrelp.c       : parser.c: Parser 'rsyslog.rfc3164' returned 0
4375.494578474:imrelp.c       : ruleset.c: processBATCH: batch of 1 elements must be processed
4375.494596386:imrelp.c       : ruleset.c: processBATCH: next msg 0: Should not get accepted
4375.494669481:imrelp.c       : rainerscript.c:     ACTION 0 [builtin:omfwd:action(type="builtin:omfwd" ...)]
4375.494736898:imrelp.c       : ruleset.c: executing action 0
4375.494744543:imrelp.c       : ../action.c: action 'omfwd': called, logging to builtin:omfwd (susp 0/0, direct q 0)
4375.494752853:imrelp.c       : omfwd queue: queue.c: queue nearly full (1000 entries), but could not drop msg (iRet: 0, severity 5)
4375.494760040:imrelp.c       : omfwd queue: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.

Environment

rsyslogd 8.38.0, compiled with:
        PLATFORM:                               x86_64-unknown-linux-gnu
        PLATFORM (lsb_release -d):
        FEATURE_REGEXP:                         Yes
        GSSAPI Kerberos 5 support:              No
        FEATURE_DEBUG (debug build, slow code): Yes
        32bit Atomic operations supported:      Yes
        64bit Atomic operations supported:      Yes
        memory allocator:                       system default
        Runtime Instrumentation (slow code):    No
        uuid support:                           Yes
        systemd support:                        No
        Number of Bits in RainerScript integers: 64

See http://www.rsyslog.com for more information.

CentOS Linux release 7.7.1908 (Core)

Linux example 4.18.0-80.7.1.el8_0.x86_64 #1 SMP Sat Aug 3 15:14:00 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

StrongestNumber9 commented 5 years ago

Same happens when using omrelp. Seems like queues are the root problem as it will properly keep on retrying when I remove queues.

rgerhards commented 5 years ago

can you change the queue type to "linkedList" and check if the same situation appears?

StrongestNumber9 commented 5 years ago

It ignored the event limit. Otherwise seems to be working.

# grep 'pszRawMsg' /queue.0000000* | wc -l
304764

# time tcpflood -T relp-plain -t 127.0.0.1 -p 601 -m 10 -M "Should not get accepted"
00000relp connect failed with return 10015
error in trying to open connection i=0
error opening connections

real    1m30.150s
user    0m0.004s
sys     0m0.007s

Edit:

7329.897089364:omfwd queue:DAwpool/w0: omfwd queue: queue.c: ConsumerDA:qqueueEnqMsg item (45) returned with error state: '-2105'
7329.897101396:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.
7331.897354105:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: cond timeout, dropping message!
7331.897412922:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: EnqueueMsg advised worker start
7331.897423834:omfwd queue:DAwpool/w0: omfwd queue: queue.c: ConsumerDA:qqueueEnqMsg item (46) returned with error state: '-2105'
7331.897436529:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.
7333.897633764:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: cond timeout, dropping message!
7333.897689231:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: EnqueueMsg advised worker start
7333.897700204:omfwd queue:DAwpool/w0: omfwd queue: queue.c: ConsumerDA:qqueueEnqMsg item (47) returned with error state: '-2105'
7333.897752648:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.
7335.897917947:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: cond timeout, dropping message!
7335.897968156:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: EnqueueMsg advised worker start
7335.897978084:omfwd queue:DAwpool/w0: omfwd queue: queue.c: ConsumerDA:qqueueEnqMsg item (48) returned with error state: '-2105'
7335.897989754:omfwd queue:DAwpool/w0: omfwd queue[DA]: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.

rgerhards commented 5 years ago

It ignored the event limit.

That limit is only for the memory part of a DA queue. The disk part does not have any limit other than configured max disk space.

I need to check if delay mode is not properly handled by disk queues.

rgerhards commented 5 years ago

@StrongestNumber9 I think I know what happens. To go quickly forward, could you do apply a minimal code change to your code. In this line https://github.com/rsyslog/rsyslog/blob/master/plugins/imrelp/imrelp.c#L242 change the LIGHT_DELAY to FULL_DELAY. I think that solves the issue you see.

We are very conservative with using FULL_DELAY mode as it blocks the pipeline and this usually is not an expected result. If FULL_DELAY works for you, I'll add a configuration option to make this configurable.

If you cannot patch, let me know. I need then to craft a testbench test first, which takes some extra time.

davidelang commented 5 years ago

On Thu, 31 Oct 2019, Strongest Number 9 wrote:

Expected behavior

Imrelp gets event, runs it through rulesets and acks NOT OK if queues are full, informing the client that your data couldn't be dealt with.

I beleive that your expectations are incorrect.

imrelp will ack a message if it is able to add it to the main queue. It doesn't matter if an output is failing and it's not able to deliver the message, or if an action queue is full. It doesn't run through the rulesets at all before doing the ack, it just checks if it can add it to the main queue (or ruleset queue if the input is bound to a ruleset)

David Lang

StrongestNumber9 commented 5 years ago

I used this patch and it doesn't seem to work

diff --git a/plugins/imrelp/imrelp.c b/plugins/imrelp/imrelp.c
index 29f9d7dfc..8f95064a2 100644
--- a/plugins/imrelp/imrelp.c
+++ b/plugins/imrelp/imrelp.c
@@ -241,7 +241,7 @@ onSyslogRcv(void *pUsr, uchar *pHostname, uchar *pIP, uchar *msg, size_t lenMsg)
        CHKiRet(msgConstruct(&pMsg));
        MsgSetInputName(pMsg, inst->pInputName);
        MsgSetRawMsg(pMsg, (char*)msg, lenMsg);
-       MsgSetFlowControlType(pMsg, eFLOWCTL_LIGHT_DELAY);
+       MsgSetFlowControlType(pMsg, eFLOWCTL_FULL_DELAY);
        MsgSetRuleset(pMsg, inst->pBindRuleset);
        pMsg->msgFlags  = PARSE_HOSTNAME | NEEDS_PARSING;

# time tcpflood -T relp-plain -t 127.0.0.1 -p 601 -m 10 -M "I should not be accepted"
00001 open connections
starting run 1
Sending 10 messages.
00000010 messages sent
runtime: 0.000
00001 close connections
End of tcpflood Run

real    0m20.016s
user    0m0.004s
sys     0m0.006s

09:18:39.670350 IP (tos 0x0, ttl 64, id 57619, offset 0, flags [DF], proto TCP (6), length 67)
    localhost.syslog-conn > localhost.42430: Flags [P.], cksum 0xfe37 (incorrect -> 0x9b0b), seq 164:179, ack 491, win 350, options [nop,nop,TS val 2925994888 ecr 2925992888], length 15
        0x0000:  4500 0043 e113 4000 4006 5b9f 7f00 0001  E..C..@.@.[.....
        0x0010:  7f00 0001 0259 a5be fe5c 85da a20c 675f  .....Y...\....g_
        0x0020:  8018 015e fe37 0000 0101 080a ae67 2388  ...^.7.......g#.
        0x0030:  ae67 1bb8 3620 7273 7020 3620 3230 3020  .g..6.rsp.6.200.
        0x0040:  4f4b 0a                                  OK.
09:18:39.670372 IP (tos 0x0, ttl 64, id 338, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.42430 > localhost.syslog-conn: Flags [.], cksum 0xfe28 (incorrect -> 0x9dbb), ack 179, win 342, options [nop,nop,TS val 2925994888 ecr 2925994888], length 0
        0x0000:  4500 0034 0152 4000 4006 3b70 7f00 0001  E..4.R@.@.;p....
        0x0010:  7f00 0001 a5be 0259 a20c 675f fe5c 85e9  .......Y..g_.\..
        0x0020:  8010 0156 fe28 0000 0101 080a ae67 2388  ...V.(.......g#.
        0x0030:  ae67 2388                                .g#.

rgerhards commented 5 years ago

OK, looks like I need to craft a testbench test.

Please note, however, that (as David said https://github.com/rsyslog/rsyslog/issues/3941#issuecomment-548540932) RELP guards the transport. It makes no guarantees beyond transport. Most importantly, there is no relationship between

imrelp -> queue -> omrelp

If you need to have a system with at-least-once delivery guarantee across the relay chain, you need to go for a message queueing system that is designed for this purpose. rsyslog does not provide these guarantees but rather targest sufficient (but not perfect) reliability for the majority of security logging and network management tasks.

That said, I still think that when the message is flagged as FULL_DELAY, it should pushback to the sender when the queue goes above queue.fullDelayMark (but that's it, nothing more).

rgerhards commented 5 years ago

I now see the issue. While the main queue is direct, we still have two queues involved - main and action. When the main queue enqueues message to the action queue, it does not apply the delay mode setting. This is to guard against the whole system to become stalled.

I will now add two more config settings:

make delay mode configurable in imrelp
add option to make queue use delay mode from message

When both are enabled, the input can even be blocked from "later" actions. This is pretty dangerous in general, thus it will not be the default. It will block the whole system, which is exactly what you want here. These settings will also delay rsyslog shutdown when in blocking state.

rgerhards commented 5 years ago

@StrongestNumber9 I have crafted PR #3943 to permit configuring the the blocking behavior. The commits have doc on the new options. Basically, you now can do:

$LocalHostName test
$AbortOnUncleanConfig on
$PreserveFQDN on

module(load="imrelp")

global(
    workDirectory="/"
    maxMessageSize="256k"
)
main_queue(queue.type="Direct")
input(type="imrelp" port="601" ruleset="incoming" MaxDataSize="256k"  flowcontrol="full")

ruleset(name="incoming") {
    action
    (
        type="omfwd"
        name="omfwd"
        RebindInterval="10000"
        protocol="tcp"
        target="127.0.0.1"
        port="10514"
        queue.size="1000"
        queue.type="Disk"
    queue.takeFlowCtlFromMsg="on"
        queue.checkpointinterval="65536"
        queue.filename="queue"
        queue.maxfilesize="32m"
        queue.saveonshutdown="on"
        queue.syncqueuefiles="off"
        queue.maxdiskspace="128m"
        ResendLastMSGOnReconnect="on"
        action.ResumeInterval="60"
        action.resumeRetryCount="-1"
    )
}

Note

flowcontrol="full"
queue.takeFlowCtlFromMsg="on"

This is as close as you can get without setting all queues to direct mode.

rgerhards commented 5 years ago

@StrongestNumber9 can you apply the patches to your version and also give it a try?

StrongestNumber9 commented 5 years ago

I am testing the patch and had some problems. I'll let you know once I have verified that those are not caused by our environment or patches

StrongestNumber9 commented 5 years ago

Hi, I managed to fix the problems in our environment. Now it randomly hangs up on shutdown when I use your configuration.

3597.852974579:omfwd queue:Reg/w0: omfwd.c: TCPSendInit FAILED with -2027.
3597.852988152:omfwd queue:Reg/w0: modules.c: file netstrms.c released module 'lmnsd_ptcp', reference count now 0
3597.852995269:omfwd queue:Reg/w0: modules.c: module 'lmnsd_ptcp' has zero reference count, unloading...
3597.853002043:omfwd queue:Reg/w0: modules.c: Unloading module lmnsd_ptcp
3597.853013793:omfwd queue:Reg/w0: modules.c: file nsd_ptcp.c released module 'lmnetstrms', reference count now 2
3597.853020137:omfwd queue:Reg/w0: modules.c:   module lmnetstrms is currently in use by file omfwd.c
3597.853026534:omfwd queue:Reg/w0: modules.c:   module lmnetstrms is currently in use by file omfwd.c
3597.856986661:omfwd queue:Reg/w0: omfwd.c: omfwd: doTryResume 127.0.0.1 iRet -2027
3597.857020089:omfwd queue:Reg/w0: ../action.c: actionCallCommitTransaction[omfwd] state: rtry mod commitTransaction returned -2007
3597.857028847:omfwd queue:Reg/w0: ../action.c: action[omfwd] transitioned to state: rtry
3597.857038789:omfwd queue:Reg/w0: errmsg.c: Called LogMsg, msg: action 'omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension.
rsyslog internal message (4,-2007): action 'omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.38.0 try http://www.rsyslog.com/e/2007 ]
3597.857060797:omfwd queue:Reg/w0: ../action.c: actionCommit[omfwd]: in retry loop, iRet -2007
3597.857068047:omfwd queue:Reg/w0: ../action.c: actionCommit[omfwd]: actionDoRetry returned 0
3597.857074011:omfwd queue:Reg/w0: ../action.c: actionTryCommit[omfwd] enter
3597.857082049:omfwd queue:Reg/w0: ../action.c: actionTryResume: action[omfwd] state: rtry, next retry (if applicable): 0 [now 1572943597]
3597.857088702:omfwd queue:Reg/w0: ../action.c: doTransaction: have commitTransaction IF, using that, pWrkrInfo 0x555555864710
3597.857106669:omfwd queue:Reg/w0: ../action.c: entering actionCallCommitTransaction[omfwd], state: rtry, nMsgs 1
3597.857114767:omfwd queue:Reg/w0: omfwd.c: TCPSendInit CREATE
3597.857132422:omfwd queue:Reg/w0: obj.c: caller requested object 'nsd_ptcp', not found (iRet -3003)
3597.857149518:omfwd queue:Reg/w0: modules.c: Requested to load module 'lmnsd_ptcp'
3597.857162873:omfwd queue:Reg/w0: modules.c: loading module 'lmnsd_ptcp.so'
3597.863485450:omfwd queue:Reg/w0: modules.c: source file nsd_ptcp.c requested reference for module 'lmnetstrms', reference count now 3
3597.863552856:main thread    : errmsg.c: Called LogMsg, msg: omfwd queue: need to do hard cancellation
rsyslog internal message (4,-3000): omfwd queue: need to do hard cancellation [v8.38.0]
3597.863667782:omfwd queue:Reg/w0: wti.c: omfwd queue:Reg/w0: cancelation cleanup handler called.
3597.863699516:omfwd queue:Reg/w0: queue.c: DeleteProcessedBatch: etry 0 state 0
3597.863729375:omfwd queue:Reg/w0: omfwd queue: queue.c: queue nearly full (1000 entries), but could not drop msg (iRet: 0, severity 7)
3597.863761230:omfwd queue:Reg/w0: omfwd queue: queue.c: doEnqSingleObject: queue FULL - configured for immediate discarding QueueSize=1000 MaxQueueSize=1000 sizeOnDisk=469000 sizeOnDiskMax=134217728
3597.863769568:omfwd queue:Reg/w0: queue.c: DeleteProcessedBatch: error -2105 re-enqueuing unprocessed data element - discarded
3597.863796809:omfwd queue:Reg/w0: queue.c: DeleteProcessedBatch: we deleted 0 objects and enqueued 1 objects
3597.863810570:omfwd queue:Reg/w0: queue.c: doDeleteBatch: delete batch from store, new sizes: log 999, phys 999
3597.863818889:omfwd queue:Reg/w0: wti.c: omfwd queue:Reg/w0: done cancelation cleanup handler.
3597.863836520:omfwd queue:Reg/w0: wtp.c: omfwd queue:Reg: Worker thread 555555864b60 requested to be cancelled.
3597.863851045:omfwd queue:Reg/w0: wtp.c: omfwd queue:Reg: Worker thread 555555864b60, terminated, num workers now 0
3597.863871581:main thread    : wti.c: cooperative worker termination failed, using cancellation...
3597.863904562:main thread    : wti 0x555555864b60: wti.c: canceling worker thread
3597.863920589:omfwd queue:Reg/w0: debug.c: destructor for debug call stack 0x7fffe80008c0 called
3597.863988355:main thread    : omfwd queue: queue.c: worker threads terminated, remaining queue size log 999, phys 999.
3597.864010367:main thread    : omfwd queue: queue.c: persisting queue to disk, 999 entries...
3597.864026950:main thread    : stream.c: file stream /out/queue.qi.tmp params: flush interval 0, async write 0
3597.865193457:main thread    : strm 0x555555862690: stream.c: strmFlushinternal: file 14(/out/queue.00000001) flush, buflen 0 (no need to flush)
3597.865225594:main thread    : stream.c: strmSerialize: pThis->prevLineSegment (nil)
3597.865274337:main thread    : strm 0x555555862bd0: stream.c: strmFlushinternal: file -1(queue) flush, buflen 0 (no need to flush)
3597.865293099:main thread    : stream.c: strmSerialize: pThis->prevLineSegment (nil)
3597.865306566:main thread    : strm 0x55555586a090: stream.c: file -1(/out/queue.qi.tmp) closing, bDeleteOnClose 0
3597.865314439:main thread    : strm 0x55555586a090: stream.c: strmFlushinternal: file -1(/out/queue.qi.tmp) flush, buflen 559
3597.865322678:main thread    : strm 0x55555586a090: stream.c: file -1(/out/queue.qi.tmp) doWriteInternal: bFlush 0
3597.865329779:main thread    : stream.c: strmPhysWrite, stream 0x55555586a090, len 559
[Thread 0x7fffeffff700 (LWP 149) exited]
3597.865436170:main thread    : stream.c: file '/out/queue.qi.tmp' opened as #12 with mode 384
3597.865456275:main thread    : strm 0x55555586a090: stream.c: opened file '/out/queue.qi.tmp' for WRITE as 12
3597.865487387:main thread    : strm 0x55555586a090: stream.c: file 12 write wrote 559 bytes
3597.865533651:main thread    : strm 0x555555862690: stream.c: file 14(/out/queue.00000001) closing, bDeleteOnClose 0
3597.865560355:main thread    : strm 0x555555862690: stream.c: strmFlushinternal: file 14(/out/queue.00000001) flush, buflen 0 (no need to flush)
3597.865575084:main thread    : strm 0x555555862930: stream.c: file 15(/out/queue.00000001) closing, bDeleteOnClose 0
3597.865592250:main thread    : strm 0x555555862bd0: stream.c: file -1(queue) closing, bDeleteOnClose 0
3597.865620212:main thread    : ruleset.c: destructing ruleset 0x5555558580e0, name 0x555555842bf0
3597.865636107:main thread    : ruleset.c: destructing ruleset 0x555555864d50, name 0x5555558643a0
3597.865644395:main thread    : rsyslogd.c: all primary multi-thread sources have been terminated - now doing aux cleanup...
3597.865659320:main thread    : rsyslogd.c: destructing current config...
3597.865667871:main thread    : rsconf.c: calling freeCnf(0x5555558567d0) for module 'builtin:omfile'
3597.865683218:main thread    : rsconf.c: calling freeCnf(0x55555583ffd0) for module 'builtin:ompipe'
3597.865692144:main thread    : rsconf.c: calling freeCnf(0x555555841a10) for module 'builtin:omfwd'
3597.865700874:main thread    : rsconf.c: calling freeCnf(0x555555863df0) for module 'imrelp'
3597.865817198:main thread    : parser.c: destructing parser 'rsyslog.rfc5424'
3597.865833410:main thread    : parser.c: destructing parser 'rsyslog.rfc3164'
3597.865840794:main thread    : pmrfc3164.c: pmrfc3164: free parser instance 0x555555841f90

rgerhards commented 5 years ago

Hi, I managed to fix the problems in our environment. Now it randomly hangs up on shutdown when I use your configuration.

yeah, that's why I caution against using this option. There was a reason that we do not use delay mode when going to the DA queue. I am pretty sure the problem is now that you run into a deadlock sooner or later when the queue is still busy. I can try to find a tie break in this case and then force.-force-cancel the thread, but that ultimately also leads to message loss. I would suggest that I take another couple of hours to look at this and provide feedback if it turns out to become another large refactoring? What do you think?

rgerhards commented 5 years ago

One idea: I could turn off handling of light delay mode during shutdown. But than again you loose the message currently being processed, just like it did in the beginning.

StrongestNumber9 commented 5 years ago

Is it possible to make shutdown so it stop inputs, wait action to finish retry and then close the queue once retried message has been re-inserted?

rgerhards commented 5 years ago

Well, the core problem is that the queue is full. So the enqueue waits until there is room. The action is unable to process any further messages. So the queue will never become less full. As such it needs to wait, eternally by config settings, until there is room.

One thing I could do is ignore the queue size limit during shutdown. But than again you run into issues if e.g. you run out of space. Nevertheless, this could work for more cases.

The bottom line is that at some points in processing you actually need to discard some messages if circumstances are so. With a zero-loss system (which rsyslog is not) you can go a long milage, but even with such a transactional message queueing system the log genenertor must at some point decide whether to cease processing or lose some logs. Otherwise it's simply impossible to do.

StrongestNumber9 commented 5 years ago

Would it be possible to set the queue limit to something like "queue.size - action.batchsize" for inputs so there will always be space left for the action to re-insert on-the-fly content? We are aware that there are circumstances where data may eventually get lost but avoiding it when in normal operation mode is desirable.

rgerhards commented 5 years ago

avoiding it when in normal operation mode is desirable.

This is something rsyslog tries hard to do. But a shutdown during stalled connection to a remote site with a full queue IMHO is out of normal operations mode.

Would it be possible to set the queue limit to something like "queue.size - action.batchsize" for inputs so there will always be space left for the action to re-insert on-the-fly content?

Doesn't work, because multiple inputs can submit data to the same queue. What we can do is ignore the queue size limitation (at least I think so). Maybe I can give that a try, that's hopefully not too much to do for an experiment.

rgerhards commented 5 years ago

I think it may be a good idea to "let ripen" potential solutions for a day or two. I would still say increasing the queue size on shutdown might be the best option. At that point, inputs are already terminated, so no new messages would come in.

StrongestNumber9 commented 5 years ago

I agree with letting it ripen for few days. Increasing queue size on shutdown also works for us.

rgerhards commented 5 years ago

@StrongestNumber9 can you reliably reproduce the "bad timing"? I do not manage to get it. In any case, can you mail me a full debug log (well, actually from the point where the imrelp input is being delayed for the first time would be sufficient).

I'd like to see more in-depth how the timing of events is.

StrongestNumber9 commented 5 years ago

I managed to reproduce it fairly reliably by sending one message to queues, then shutting down rsyslog and restarting it up again using rsyslogd -f rsyslog.conf -dn 2>&1 | tee output.log. I sent you the full log to the email on your profile.

rgerhards commented 5 years ago

Was the receiver active during your procedure? I guess "no", but want to confirm.

StrongestNumber9 commented 5 years ago

Not on the log I sent you, but hang happened in both cases as long as there were events in queues. Clean shutdown if queues were empty and no events received.

rgerhards commented 5 years ago

I made an experimental commit to my PR, switching the queue default. Maybe some existing tests catch it as well (would be best).

rgerhards commented 5 years ago

I reviewed the debug log you sent. It's interesting, actually no relp message was handled in that case in any way.

Would it be possible for you to re-try with current master branch plus the new patches and see if the problems during shutdown persist? I just wanted to rule out that it has something to do with your old version plus custom patches. Once I know it also happens to the "regular" code base, it's possibly easier to check what is going on.

My experiment with the changed default did not bring up any issues. It's not yet 100% finished, but sufficient environments, including CentOS, did pass. I'll still monitor it.

StrongestNumber9 commented 5 years ago

I tested the v8.1910.0 version with only #3943 patched in and I still get the hangup but it is not consistently happening. I sent you more detailed information and logs of our environment.

rgerhards commented 5 years ago

thx again

StrongestNumber9 commented 5 years ago

Just as a note: I can't reproduce it using just -n. I am not sure if -d enables some race condition or if -d is the culprit itself.

StrongestNumber9 commented 5 years ago

We suspect that thread "omfwd queue:Reg/w0" dies within ReleaseObj and therefore lock is never released causing a deadlock.

0332.820284601:main thread    : operatingstate.c: osf: MSG omfwd queue: need to do hard cancellation: 0332.820287316:omfwd queue:Reg/w0: obj.c: pIf->ifIsLoaded is set to 1rsyslog internal message (4,-3000): omfwd queue: need to do hard cancellation [v8.1910.0]

Debug prints added:

diff --git a/runtime/obj.c b/runtime/obj.c
index ad76ba9ee..3d3fc05e9 100644
--- a/runtime/obj.c
+++ b/runtime/obj.c
@@ -1261,19 +1261,22 @@ ReleaseObj(const char *srcFile, uchar *pObjName, uchar *pObjFile, interface_t *p
        DEFiRet;
        objInfo_t *pObjInfo;

-       /* dev debug only dbgprintf("source file %s releasing object '%s',
-       ifIsLoaded %d\n", srcFile, pObjName, pIf->ifIsLoaded); */
+       dbgprintf("source file %s releasing object '%s', ifIsLoaded %d\n", srcFile, pObjName, pIf->ifIsLoaded);
        pthread_mutex_lock(&mutObjGlobalOp);

-       if(pObjFile == NULL)
+       if(pObjFile == NULL) {
+        dbgprintf("pObjFile == NULL");
                FINALIZE; /* if it is not a lodable module, we do not need to do anything... */
-
+    }
        if(pIf->ifIsLoaded == 0) {
+        dbgprintf("pIf->ifIsLoaded == 0");
                FINALIZE; /* we are not loaded - this is perfectly OK... */
        } else if(pIf->ifIsLoaded == 2) {
+        dbgprintf("pIf->ifIsLoaded == 2");
                pIf->ifIsLoaded = 0; /* clean up */
                FINALIZE; /* we had a load error and can not/must not continue */
        }
+    dbgprintf("pIf->ifIsLoaded is set to %d", pIf->ifIsLoaded);

        CHKiRet(FindObjInfo((const char*)pObjName, &pObjInfo));

rgerhards commented 5 years ago

@StrongestNumber9 thx - that's a good catch. I'll have a look hopefully after noon.

rgerhards commented 5 years ago

FYI: the debug log you sent also has some interesting insight ... too early to comment in depth, but I wanted to let you know.

rgerhards commented 5 years ago

@StrongestNumber9 which version of librelp do you use? Is it plain or does it have custom patches? -- Just trying to set the baseline.

StrongestNumber9 commented 5 years ago

All used libraries are vanilla without patches.

liblogging v1.0.6 libestr v0.1.10 libfastjson v0.99.8 liblognorm v2.0.4 librelp v1.2.17

rgerhards commented 5 years ago

@StrongestNumber9 I think there is a problem in imrelp that at least contributes to the issue here. See https://github.com/rsyslog/librelp/issues/175. While I think it is not the sole cause, it would be great if you could apply a patch to librelp once I have it ready. It should go on to of 1.4.0, but would most probably also work with older source versions.

rgerhards commented 5 years ago

@StrongestNumber9 librelp PR looks good - would be great if you could apply https://github.com/rsyslog/librelp/pull/176/files

rgerhards commented 5 years ago

@StrongestNumber9 Good news! I could just accidentally reproduce the hang in my environment. After I got it, I could craft a script which can reproduce it by doing sufficient iterations until it fails. So it looks like I can work on the "real thing" now.

Side-note: I have the librelp patch active. So it does not solve that case. I would still recommend to apply it (and give it a try), as without the fix the likelyhood of problems is greater.

rgerhards commented 5 years ago

We suspect that thread "omfwd queue:Reg/w0" dies within ReleaseObj and therefore lock is never released causing a deadlock.

I can outrule this now: the problem even happens when the mutex is not locked and unlocked at all. While it was still present, I have also seen a hang at a different location.

I now think it is more likely due to the debugging instrumentation. I emit debug messages as places and at times where it is not totally safe to do so. The alternative is to not emit them, so I decided to do it albeit it is not 100% correct. After all, debug mode is for developers who chase bugs. I will try to verify this in more depth. I have not seen any real complaint from valgrind/helgrind, but I will probably do another session with TSAN tomorrow.

As far as I can see, the commits I made influence timing and thus the problem occurs only with them, but it looks like they are correct. Still to be confirmed, of course.

rgerhards commented 5 years ago

@StrongestNumber9 It looks like I found the culprit. It's actually debug code, like I suspected. I use a write, which is not signal-safe, during a signal handler. This is useful for debugging, but obviously causes occasional problems. I will now remove them for the time being, but I cannot outrule other similar issues exists. If I remove all of them completely, debugging becomes much harder. So the rest will stay.

rgerhards commented 5 years ago

@StrongestNumber9 pls apply https://github.com/rgerhards/rsyslog/commit/6518a92e387cb68a048aa0e6e4c844e8baf97591 to your version. For me this made the problem reliably go away (at least for several hundered test runs).

TSAN also complained before the fix but not after.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

rsyslog / rsyslog