nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.45k stars 1.38k forks source link

NATS KeyValue Corruption #5644

Closed jrovira-kumori closed 2 weeks ago

jrovira-kumori commented 1 month ago

Observed behavior

JetStream sometimes does not automatically recover correctly from a forced termination. The original error was found after an OOM kill. Restarting the corrupted instance does not fix the issue. Only removing the persistent volume and restarting the affected instance fixed the issue.

Expected behavior

After a forced termination of the stream leader, the KV bucket should always remain consistent.

Server and client version

Server: 2.10.17 Client: 0.1.4

Host environment

OS: Xubuntu 22.04.4 LTS x86_64
Kernel: 6.5.0-41-generic
CPU: AMD Ryzen 7 7730U with Radeon Graphics (16) @ 4.546GHz
GPU: AMD ATI 04:00.0 Barcelo
Memory: 31392MiB

Steps to reproduce

I uploaded a minimal reproducible example at jrovira-kumori/NATS-KV-Corruption. Requires docker and bash.

git clone https://github.com/jrovira-kumori/NATS-KV-Corruption
cd NATS-KV-Corruption
./run.sh
derekcollison commented 1 month ago

On a plane atm so hard to run your scripts and docker with not so great wifi. However, I will once back on the ground.

If possible could you swap in the 2.10.18-RC2 image we have prepped and see if the issue still is there?

docker pull synadia/nats-server:2.10.18-RC.2-alpine3.20

jrovira-kumori commented 1 month ago

I have run the repro multiple times and the issue persists with synadia/nats-server:2.10.18-RC.2-alpine3.20. I also checked with nats:2.9.25-alpine with the same results.

jrovira-kumori commented 1 month ago

Any update on this issue?

Thank you for your time.

katrinwab commented 3 weeks ago

we have the same problem last test - nats-server:2.10.18

bfoxstudio commented 3 weeks ago

we have the same problem last test - nats-server:2.10.18

MauriceVanVeen commented 2 weeks ago

Have been using your reproducible example extensively in tracking down (part of) the issue. Huge kudos, @jrovira-kumori! :tada:

With the above example code and fix I'm not able to reproduce the issue anymore. However, we're continuing the investigation since not all cases of desynced replicas are solved. With the fix it should be more reliable though.

jrovira-kumori commented 1 week ago

I have tried the latest nightly synadia/nats-server:nightly-20240819 and it works like a charm!

Thank you very much for taking the time to deal with this issue.