liftbridge-io / liftbridge

Lightweight, fault-tolerant message streams.
https://liftbridge.io
Apache License 2.0
2.58k stars 107 forks source link

Stepback cursor value when problems on cluster #319

Closed aversant closed 2 years ago

aversant commented 3 years ago
Liftbridge Version:        v1.4.1"
Server ID:                 mevent04"
Namespace:                 liftbridge-default"
Default Retention Policy:  [Age: 1 day, Compact: false]
Default Partition Pausing: disabled"

Cursor Value Graph (+2 Hour than log) And used log 10 on Y axes. image

We have Liftbridge cluster with 3 nodes: mevent02, mevent03, mevent04. Own clients used cursors for set last processed message offset.

If cluster start rebuild topology (lost connection between nodes or reboot one) cursor value stepback to old value (approximate day ago). It problem causes repeat processed message if client reconnect (fetch old cursor position). We analyzed own client, but not founded any problem with cursor. How on __cursors partition may stored old value ?

On attached log files - messages Jan 23 18:06:27 mevent02 liftbridge[776]: time="2021-01-23 18:06:27" level=debug msg="Finished compacting log [subject=liftbridge-default.cursors, stream=__cursors, partition=0]\n\tMessages Removed: 0\n\tSegments: 2 -> 2\n\tDuration: 13.455916659s" very confused.

I will be grateful for a hint.

SetCursor:

Jan 23 16:47:15 mevent04 liftbridge[3769]: time="2021-01-23 16:47:15" level=debug msg="api: SetCursor [stream=events.TULA, partition=0, cursorId=classifier, offset=39614]"
Jan 23 16:49:04 mevent04 liftbridge[3769]: time="2021-01-23 16:49:04" level=debug msg="api: SetCursor [stream=events.TULA, partition=0, cursorId=classifier, offset=39615]"
Jan 23 16:49:06 mevent04 liftbridge[3769]: time="2021-01-23 16:49:06" level=debug msg="api: SetCursor [stream=events.TULA, partition=2, cursorId=classifier, offset=39195]"
.......

Jan 23 17:16:36 mevent04 systemd[1]: liftbridge.service: main process exited, code=killed, status=9/KILL
Cursor Leader mevent04 -> mevent03

Jan 23 17:16:56 mevent03 liftbridge[1074]: time="2021-01-23 17:16:56" level=debug msg="api: SetCursor [stream=events.TULA, partition=0, cursorId=classifier, offset=33294]"
Jan 23 17:17:25 mevent03 liftbridge[1074]: time="2021-01-23 17:17:25" level=debug msg="api: SetCursor [stream=events.TULA, partition=0, cursorId=classifier, offset=34017]"
Jan 23 17:17:25 mevent03 liftbridge[1074]: time="2021-01-23 17:17:25" level=debug msg="api: SetCursor [stream=events.TULA, partition=3, cursorId=classifier, offset=32868]"
Jan 23 17:17:25 mevent03 liftbridge[1074]: time="2021-01-23 17:17:25" level=debug msg="api: SetCursor [stream=events.TULA, partition=0, cursorId=classifier, offset=34018]"

liftbridge.yml

clustering:
    min.insync.replicas: 2
    raft.bootstrap.peers:
    - mevent02
    - mevent03
    - mevent04
    replica.fetch.timeout: 7s
    server.id: mevent03
cursors:
    stream.auto.pause.time: 0
    stream.partitions: 1
data:
    dir: /var/lib/liftbridge
host: X.X.X.85
logging:
    level: debug
    raft: true
nats.servers:
- nats://X.X.X.84:4222
- nats://X.X.X.85:4222
- nats://X.X.X.86:4222
streams:
    compact.enabled: false
    retention.max:
        age: 24h

Servers logs liftbridge_mevent02_log.zip liftbridge_mevent03_log.zip liftbridge_mevent04_log.zip

PS. Graph +2 hour than log. And used log 10 on Y axes.

After previous trouble (17.01.2021) we dropped stream. image

aversant commented 3 years ago

The following may have happened: 1) If we not set segmentMaxAge value, it equal by default retentionMaxAge. That we have 2 segment (today and yesterday) 2) From code comment: Compaction is applied to all segments up to but excluding the active (last) segment or the provided HW. By that compact worked only yesterday segment. And it segment contain last yesterday cursor value. 3) If the new leader has problems with sync, it dropped today segment and use yesterday (with old cursor value. 4) After connect old leader, it sync its log with new leader.

I fix segmentMaxAge settings to 1 hour, be watching. Now __cursors stream used segmentMaxAge value from stream config. May be set value on config cursors block?

tylertreat commented 3 years ago

It's possible this is what happened. However, if a node is not able to replicate it should be removed from the in-sync replica set (ISR) and not added back until it is caught up. In this case, it should not be eligible for becoming the leader. Did you see logs relating to a node being removed from the ISR for cursors partition?

aversant commented 3 years ago

We found the reason. This is OOM Killer on the host... One of the three hosts had 2 times less memory that other and it host is __cursor partition leader :( After OOM killing the process, the cursors are rolled back. For reproduce it: kill -9 command on liftbridge process that __cursor leader.

After increasing the memory, the problem was solved. Very thanks for help.

PS. After write message problem repeat. On client side message:

2021-02-07 05:44:26,040 [noc.core.liftbridge.base] Failed to set cursor: <AioRpcError of RPC that terminated with:
        status = StatusCode.FAILED_PRECONDITION
        details = "Server not leader for cursors partition 0"
        debug_error_string = "{"created":"@1612665866.040024535","description":"Error received from peer ipv4:10.36.129.86:9292","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Server not leader for cursors partition 0","grpc_status":9}"
>

But liftbridge process was not killed. On log __cursors leader:

Feb 07 05:35:24 mevent04 liftbridge[6994]: time="2021-02-07 05:35:24" level=error msg="Replica mevent03 for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0] exceeded max lag time (last seen: 17.208723228s, last caught up: 17.208723228s), removing from ISR"

It turns out that the situation occurs when the __cursors stream leader changes.

tylertreat commented 3 years ago

Interesting discovery. Liftbridge does not currently provide a config for forcing flush/fsync to disk, instead opting for replication to provide durability. Since cursors are written with AckPolicy_ALL (i.e. message is committed), I wouldn't expect a stepback even in the event of an OOM on a node. I'll keep this open to investigate.

aversant commented 2 years ago

I assume that the LRU cache is to blame for the situation. The following sequence of events:

We have 3 node cluster: node1, node2, node3. Cursor leader is node2

  1. Each SetCursor save offset to cache c.cache.Add(string(cursorKey), cursor.Offset)
  2. With problems on cluster (example NATS timeout) move cursor leader to node3
  3. Cache is local on Liftbridge proccess and save state on node2 until its restart
  4. New cursor offsets save to node3 cache.
  5. If cursor leader move back to node2 - fetch cursor query return old value from cache if offset, ok := c.cache.Get(string(cursorKey)); ok { until SetCursor will be called.

I tried remove and create __cursor stream on 3 nodes cluster:

[root@db noc]# ./noc liftbridge fetch-cursor --name cursor-test --stream test-stream1
2021-11-20 13:45:58,622 [noc.core.dcs.base] Resolved near service liftbridge to ['10.36.129.21:9292']
-1

[root@db noc]# ./noc liftbridge set-cursor --name cursor-test --stream test-stream1 --offset 200
2021-11-20 13:46:28,785 [noc.core.dcs.base] Resolved near service liftbridge to ['10.36.129.20:9292']

[root@db noc]# ./noc liftbridge fetch-cursor --name cursor-test --stream test-stream1
2021-11-20 13:46:32,184 [noc.core.dcs.base] Resolved near service liftbridge to ['10.36.129.20:9292']
201

[root@db noc]#  ./noc liftbridge delete-stream --name __cursors

reload node1 for recreate stream 

[root@db noc]# ./noc liftbridge fetch-cursor --name cursor-test --stream test-stream1
2021-11-20 13:53:52,859 [noc.core.dcs.base] Resolved near service liftbridge to ['10.36.129.20:9292']
201

[root@mstage-db noc]# ./noc liftbridge subscribe --name __cursors | grep test
^C
[root@mstage-db noc]# 

Cursor not in stream, but cursor offset still return by query. I think if you call a soft leader transfer, you can see the previous cursor value.

tylertreat commented 2 years ago

I'm thinking we need to clear the cache on cursor leader failover.

tylertreat commented 2 years ago

Fixed in #395.