Closed aversant closed 2 years ago
The following may have happened:
1) If we not set segmentMaxAge
value, it equal by default retentionMaxAge
. That we have 2 segment (today and yesterday)
2) From code comment: Compaction is applied to all segments up to but excluding the active (last) segment or the provided HW. By that compact worked only yesterday segment. And it segment contain last yesterday cursor value.
3) If the new leader has problems with sync, it dropped today segment and use yesterday (with old cursor value.
4) After connect old leader, it sync its log with new leader.
I fix segmentMaxAge
settings to 1 hour, be watching. Now __cursors
stream used segmentMaxAge
value from stream config. May be set value on config cursors
block?
It's possible this is what happened. However, if a node is not able to replicate it should be removed from the in-sync replica set (ISR) and not added back until it is caught up. In this case, it should not be eligible for becoming the leader. Did you see logs relating to a node being removed from the ISR for cursors partition?
We found the reason. This is OOM Killer
on the host... One of the three hosts had 2 times less memory that other and it host is __cursor
partition leader :( After OOM killing the process, the cursors are rolled back. For reproduce it: kill -9
command on liftbridge
process that __cursor
leader.
After increasing the memory, the problem was solved. Very thanks for help.
PS. After write message problem repeat. On client side message:
2021-02-07 05:44:26,040 [noc.core.liftbridge.base] Failed to set cursor: <AioRpcError of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "Server not leader for cursors partition 0"
debug_error_string = "{"created":"@1612665866.040024535","description":"Error received from peer ipv4:10.36.129.86:9292","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Server not leader for cursors partition 0","grpc_status":9}"
>
But liftbridge
process was not killed. On log __cursors
leader:
Feb 07 05:35:24 mevent04 liftbridge[6994]: time="2021-02-07 05:35:24" level=error msg="Replica mevent03 for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0] exceeded max lag time (last seen: 17.208723228s, last caught up: 17.208723228s), removing from ISR"
It turns out that the situation occurs when the __cursors
stream leader changes.
Interesting discovery. Liftbridge does not currently provide a config for forcing flush/fsync to disk, instead opting for replication to provide durability. Since cursors are written with AckPolicy_ALL
(i.e. message is committed), I wouldn't expect a stepback even in the event of an OOM on a node. I'll keep this open to investigate.
I assume that the LRU cache is to blame for the situation. The following sequence of events:
We have 3 node cluster: node1
, node2
, node3
. Cursor leader is node2
SetCursor
save offset to cache c.cache.Add(string(cursorKey), cursor.Offset)node3
Liftbridge
proccess and save state on node2
until its restart
node3
cache.node2
- fetch cursor query return old
value from cache
if offset, ok := c.cache.Get(string(cursorKey)); ok { until SetCursor
will be called.I tried remove and create __cursor
stream on 3 nodes cluster:
[root@db noc]# ./noc liftbridge fetch-cursor --name cursor-test --stream test-stream1
2021-11-20 13:45:58,622 [noc.core.dcs.base] Resolved near service liftbridge to ['10.36.129.21:9292']
-1
[root@db noc]# ./noc liftbridge set-cursor --name cursor-test --stream test-stream1 --offset 200
2021-11-20 13:46:28,785 [noc.core.dcs.base] Resolved near service liftbridge to ['10.36.129.20:9292']
[root@db noc]# ./noc liftbridge fetch-cursor --name cursor-test --stream test-stream1
2021-11-20 13:46:32,184 [noc.core.dcs.base] Resolved near service liftbridge to ['10.36.129.20:9292']
201
[root@db noc]# ./noc liftbridge delete-stream --name __cursors
reload node1 for recreate stream
[root@db noc]# ./noc liftbridge fetch-cursor --name cursor-test --stream test-stream1
2021-11-20 13:53:52,859 [noc.core.dcs.base] Resolved near service liftbridge to ['10.36.129.20:9292']
201
[root@mstage-db noc]# ./noc liftbridge subscribe --name __cursors | grep test
^C
[root@mstage-db noc]#
Cursor not in stream, but cursor offset still return by query. I think if you call a soft leader transfer, you can see the previous cursor value.
I'm thinking we need to clear the cache on cursor leader failover.
Fixed in #395.
Cursor Value Graph (+2 Hour than log) And used log 10 on Y axes.
We have Liftbridge cluster with 3 nodes: mevent02, mevent03, mevent04. Own clients used cursors for set last processed message offset.
If cluster start rebuild topology (lost connection between nodes or reboot one) cursor value stepback to old value (approximate day ago). It problem causes repeat processed message if client reconnect (fetch old cursor position). We analyzed own client, but not founded any problem with cursor. How on
__cursors
partition may stored old value ?On attached log files - messages
Jan 23 18:06:27 mevent02 liftbridge[776]: time="2021-01-23 18:06:27" level=debug msg="Finished compacting log [subject=liftbridge-default.cursors, stream=__cursors, partition=0]\n\tMessages Removed: 0\n\tSegments: 2 -> 2\n\tDuration: 13.455916659s"
very confused.I will be grateful for a hint.
SetCursor
:liftbridge.yml
Servers logs liftbridge_mevent02_log.zip liftbridge_mevent03_log.zip liftbridge_mevent04_log.zip
PS. Graph +2 hour than log. And used log 10 on Y axes.
After previous trouble (17.01.2021) we dropped stream.