liftbridge-io / liftbridge

Lightweight, fault-tolerant message streams.
https://liftbridge.io
Apache License 2.0
2.57k stars 107 forks source link

Liftbridge runs into not recoverable issues after x days running in an Azure k8s cluster #243

Closed nfoerster closed 4 years ago

nfoerster commented 4 years ago

Liftbridge Version: 1.2.0

Hello,

the second time we run into not recoverable issues with the liftbridge deployment in our k8s cluster. We have 3 NATS pods and 3 Liftbridge pods running:

NAME                             READY   STATUS             RESTARTS   AGE
nats-default-1               1/1     Running            0          28d
nats-default-2               1/1     Running            0          28d
nats-default-3               1/1     Running            0          28d
liftbridge-0                     0/1     CrashLoopBackOff   20         84m
liftbridge-1                     1/1     Running            20         78m
liftbridge-2                     0/1     CrashLoopBackOff   23         93m
nats-operator-669766fdfd-6jfq2   1/1     Running            1          28d

As shown above, only one pod runs after the issues occurred, but searching for the other two pods in the cluster it fails also to work. There are different errors in the logs but the root cause seems to be an epoch selection mismatch:

time="2020-08-12 07:50:03" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99925 tally=1\n"
time="2020-08-12 07:50:04" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:04" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-2 cluster-liftbridge-2}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:05" level=warning msg=" raft: Election timeout reached, restarting election\n"
time="2020-08-12 07:50:05" level=info msg=" raft: entering candidate state: node=\"Node at cluster-liftbridge-1 [Candidate]\" term=99926\n"
time="2020-08-12 07:50:05" level=debug msg="raft: votes: needed=2\n"
time="2020-08-12 07:50:05" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99926 tally=1\n"
time="2020-08-12 07:50:05" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:05" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-2 cluster-liftbridge-2}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:06" level=warning msg=" raft: Election timeout reached, restarting election\n"
time="2020-08-12 07:50:06" level=info msg=" raft: entering candidate state: node=\"Node at cluster-liftbridge-1 [Candidate]\" term=99927\n"
time="2020-08-12 07:50:06" level=debug msg="raft: votes: needed=2\n"
time="2020-08-12 07:50:06" level=debug msg="raft: vote granted: from=cluster-liftbridge-0 term=99927 tally=1\n"
time="2020-08-12 07:50:06" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99927 tally=2\n"
time="2020-08-12 07:50:06" level=info msg=" raft: election won: tally=2\n"
time="2020-08-12 07:50:06" level=info msg=" raft: entering leader state: leader=\"Node at cluster-liftbridge-1 [Leader]\"\n"
time="2020-08-12 07:50:06" level=info msg=" raft: added peer, starting replication: peer=cluster-liftbridge-0\n"
time="2020-08-12 07:50:06" level=info msg=" raft: added peer, starting replication: peer=cluster-liftbridge-2\n"
time="2020-08-12 07:50:06" level=info msg="Server became metadata leader, performing leader promotion actions"
time="2020-08-12 07:50:06" level=info msg=" raft: pipelining replication: peer=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\"\n"
time="2020-08-12 07:50:06" level=debug msg="fsm: Replaying Raft log..."
time="2020-08-12 07:50:06" level=debug msg="fsm: Created stream [name=mqtt, subject=mqtt, partitions=1]"
panic: failed to add stream to metadata store: failed to create commit log: failed to read leader epoch offsets file: duplicate leader epoch 5599

goroutine 72 [running]:
github.com/liftbridge-io/liftbridge/server.(*Server).Apply(0xc0002940e0, 0xc000314f00, 0x0, 0x0)
    /workspace/server/fsm.go:111 +0x407
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000fdb4d0)
    /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:90 +0x2aa
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0006f6000, 0x40, 0x40)
    /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc000490000)
    /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:219 +0x3a9
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000490000, 0xc00031a8c0)
    /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:146 +0x5d
created by github.com/hashicorp/raft.(*raftState).goFunc
    /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:144 +0x66

This issue is unrecoverable until now. All pods have an own persistent-volume-claim, storing their raft and stream data persistently. If you restart a broken pod (instance 0 or 1) the currently other running pod will crash. However, the third pod (instance 2) crashs directly with another error.

This is our liftbridge configuration:

liftbridge.yaml: |
    listen: 0.0.0.0:9292
    host: liftbridge-clients.nats
    port: 9292

    logging:
      level: debug
      recovery: true
      raft: true

    nats.servers:
      - "nats://nats-default.nats:4222"

    streams:
      retention.max:
        bytes: 20937965568
    clustering:
      min.insync.replicas: 2

The logs are attached. logs-from-liftbridge-in-liftbridge-2 (1).txt logs-from-liftbridge-in-liftbridge-2.txt logs-from-liftbridge-in-liftbridge-1 (1).txt logs-from-liftbridge-in-liftbridge-1.txt logs-from-liftbridge-in-liftbridge-0 (1).txt logs-from-liftbridge-in-liftbridge-0.txt

Do you have any glue about the error or how to recover the deployment? Thank you in advance.

LaPetiteSouris commented 4 years ago

Just by curiosity, do you use any persistent volume mount for Liftbridge pods?

nfoerster commented 4 years ago

Yes we do, every pod has a persistent volume claim in azure: image

LaPetiteSouris commented 4 years ago

Yes we do, every pod has a persistent volume claim in azure: image

Is there any chance that we can take a look in data-dir folder ? In your configuration, I do not see that you specified in to the mount, so I presume that it is in by default at "/tmp/liftbridge/<namespace>"

I guess it may help to check out the content inside.

nfoerster commented 4 years ago

Yes we can look inside. In /data we see the folders raft with raft.db and streamscontaining the streamdata. Which file is interested? image

tylertreat commented 4 years ago

Sorry that you're experiencing this. Can you provide the contents of the leader-epoch-checkpoint file for the particular stream partition on the node that is crashing? It will be in the partition data directory.

LaPetiteSouris commented 4 years ago

Yes we can look inside. In /data we see the folders raft with raft.db and streamscontaining the streamdata. Which file is interested?

Judging from your logs, it looks like tmp/liftbridge/<namespace>/streams/mqtt/1 as your stream is mqtt and on partition 1

nfoerster commented 4 years ago

Yes we can look inside. In /data we see the folders raft with raft.db and streamscontaining the streamdata. Which file is interested?

Judging from your logs, it looks like tmp/liftbridge/<namespace>/streams/mqtt/1 as your stream is mqtt and on partition 1

Hmm there should be no partition 1 at any of the streams, all data is partitioned to 0.

Sorry that you're experiencing this. Can you provide the contents of the leader-epoch-checkpoint file for the particular stream partition on the node that is crashing? It will be in the partition data directory.

I find the problematic lines by searching for the number 5585 in the file in stream stream_meters15:

0
3
5585 3965
120 3085
5585 4425

Files are attached.

I changed the name of the files before attaching here to: \<streamname>_\<podnumber>_leader-epoch-checkpoint.txt and \<streamname>_\<podnumber>_replication-offset-checkpoint.txt

mqtt_0_leader-epoch-checkpoint.txt mqtt_0_replication-offset-checkpoint.txt stream_meters15_0_leader-epoch-checkpoint.txt stream_meters15_0_replication-offset-checkpoint.txt mqtt_1_leader-epoch-checkpoint.txt mqtt_1_replication-offset-checkpoint.txt stream_meters15_1_leader-epoch-checkpoint.txt stream_meters15_1_replication-offset-checkpoint.txt

tylertreat commented 4 years ago

@nfoerster Thanks for providing the epoch checkpoint contents. Admittedly, this is a strange problem. The issue is due to the duplicate entry for the leader epoch 5585. The bigger issue though is the 120 3085 entry in between. That generally should not be possible since we only allow adding entries with greater epoch and offset, but it's possible there is a bug that is allowing this entry to be made.

Since you have debug logs enabled, there should be logs indicating these epoch entries. They start with Updated log leader epoch. For example:

DEBU[2020-08-13 10:56:37] Updated log leader epoch. New: {epoch:5, offset:-1}, Previous: {epoch:0, offset:-1} for log [subject=foo, stream=foo-stream, partition=0]. Cache now contains 1 entry.

Do you see these logs on the nodes that crash leading up to the crash?

tylertreat commented 4 years ago

FYI, I did make a small fix after reviewing the leader epoch caching code (https://github.com/liftbridge-io/liftbridge/pull/245). I'm not 100% certain this will fix the issue you're seeing without more information, but if you're able, it would be worth a try. To get the cluster into a working state, you'll need to delete the leader-epoch-checkpoint files on the failed nodes.

nfoerster commented 4 years ago

Do you see these logs on the nodes that crash leading up to the crash?

Unfortunately not, thats a big issue.

Thank you very much for the error description and the supplied patch. We will integrate the patch and also store all logs on debug level, so if the issue occurs again, we can further investigate. If that is the case we shall reopen the issue.