Closed nfoerster closed 4 years ago
Just by curiosity, do you use any persistent volume mount for Liftbridge pods?
Yes we do, every pod has a persistent volume claim in azure:
Yes we do, every pod has a persistent volume claim in azure:
Is there any chance that we can take a look in data-dir
folder ? In your configuration, I do not see that you specified in to the mount, so I presume that it is in by default at "/tmp/liftbridge/<namespace>"
I guess it may help to check out the content inside.
Yes we can look inside. In /data
we see the folders raft
with raft.db
and streams
containing the streamdata. Which file is interested?
Sorry that you're experiencing this. Can you provide the contents of the leader-epoch-checkpoint
file for the particular stream partition on the node that is crashing? It will be in the partition data directory.
Yes we can look inside. In
/data
we see the foldersraft
withraft.db
andstreams
containing the streamdata. Which file is interested?
Judging from your logs, it looks like tmp/liftbridge/<namespace>/streams/mqtt/1
as your stream is mqtt
and on partition 1
Yes we can look inside. In
/data
we see the foldersraft
withraft.db
andstreams
containing the streamdata. Which file is interested?Judging from your logs, it looks like
tmp/liftbridge/<namespace>/streams/mqtt/1
as your stream ismqtt
and onpartition 1
Hmm there should be no partition 1 at any of the streams, all data is partitioned to 0.
Sorry that you're experiencing this. Can you provide the contents of the
leader-epoch-checkpoint
file for the particular stream partition on the node that is crashing? It will be in the partition data directory.
I find the problematic lines by searching for the number 5585 in the file in stream stream_meters15:
0
3
5585 3965
120 3085
5585 4425
Files are attached.
I changed the name of the files before attaching here to: \<streamname>_\<podnumber>_leader-epoch-checkpoint.txt and \<streamname>_\<podnumber>_replication-offset-checkpoint.txt
mqtt_0_leader-epoch-checkpoint.txt mqtt_0_replication-offset-checkpoint.txt stream_meters15_0_leader-epoch-checkpoint.txt stream_meters15_0_replication-offset-checkpoint.txt mqtt_1_leader-epoch-checkpoint.txt mqtt_1_replication-offset-checkpoint.txt stream_meters15_1_leader-epoch-checkpoint.txt stream_meters15_1_replication-offset-checkpoint.txt
@nfoerster Thanks for providing the epoch checkpoint contents. Admittedly, this is a strange problem. The issue is due to the duplicate entry for the leader epoch 5585
. The bigger issue though is the 120 3085
entry in between. That generally should not be possible since we only allow adding entries with greater epoch and offset, but it's possible there is a bug that is allowing this entry to be made.
Since you have debug logs enabled, there should be logs indicating these epoch entries. They start with Updated log leader epoch.
For example:
DEBU[2020-08-13 10:56:37] Updated log leader epoch. New: {epoch:5, offset:-1}, Previous: {epoch:0, offset:-1} for log [subject=foo, stream=foo-stream, partition=0]. Cache now contains 1 entry.
Do you see these logs on the nodes that crash leading up to the crash?
FYI, I did make a small fix after reviewing the leader epoch caching code (https://github.com/liftbridge-io/liftbridge/pull/245). I'm not 100% certain this will fix the issue you're seeing without more information, but if you're able, it would be worth a try. To get the cluster into a working state, you'll need to delete the leader-epoch-checkpoint
files on the failed nodes.
Do you see these logs on the nodes that crash leading up to the crash?
Unfortunately not, thats a big issue.
Thank you very much for the error description and the supplied patch. We will integrate the patch and also store all logs on debug level, so if the issue occurs again, we can further investigate. If that is the case we shall reopen the issue.
Liftbridge Version: 1.2.0
Hello,
the second time we run into not recoverable issues with the liftbridge deployment in our k8s cluster. We have 3 NATS pods and 3 Liftbridge pods running:
As shown above, only one pod runs after the issues occurred, but searching for the other two pods in the cluster it fails also to work. There are different errors in the logs but the root cause seems to be an epoch selection mismatch:
This issue is unrecoverable until now. All pods have an own persistent-volume-claim, storing their raft and stream data persistently. If you restart a broken pod (instance 0 or 1) the currently other running pod will crash. However, the third pod (instance 2) crashs directly with another error.
This is our liftbridge configuration:
The logs are attached. logs-from-liftbridge-in-liftbridge-2 (1).txt logs-from-liftbridge-in-liftbridge-2.txt logs-from-liftbridge-in-liftbridge-1 (1).txt logs-from-liftbridge-in-liftbridge-1.txt logs-from-liftbridge-in-liftbridge-0 (1).txt logs-from-liftbridge-in-liftbridge-0.txt
Do you have any glue about the error or how to recover the deployment? Thank you in advance.