k3s-io / kine

Run Kubernetes on MySQL, Postgres, sqlite, dqlite, not etcd.
Apache License 2.0
1.49k stars 226 forks source link

"TTL event watch failed to get start revision" #289

Open tamalsaha opened 3 months ago

tamalsaha commented 3 months ago

I have been running an k8s extended apiserver whose data is stores in kine sqlite in a sidecar. After running nonstop for 8 hours, kine failed, all data is gone and I only see "TTL event watch failed to get start revision" error in the kine logs. What might be the cause of this issue. I have generally seen kine to result in this types of stability issues before. Any help will the appreciated.

brandond commented 3 months ago

all data is gone

It sounds like you forgot to put the database file for the kine sidecar on a real volume, and it was instead stored on tmpfs or simply within the container filesystem and was discarded when the sidecar restarted. You've not provided any logs so I can't say why it failed, but I can recommend that you keep the database file on a volume next time, if you want it to persist across container restarts.

I have generally seen kine to result in this types of stability issues before.

We have not seen stability issues like this. Kine is used daily by thousands of k3s users without issue.

tamalsaha commented 3 months ago

@brandond , thanks for the quick response.

I think I am keeping the data inside a PVC (/var/data). You can see my helm chart here: https://github.com/kubeops/installer/blob/master/charts/scanner/templates/statefulset.yaml#L192-L211

When this happened, I tried to recover by restarting the kine pod. That did not fix the error. I had to stop the kine pod, delete the PVC, get a fresh PVC and restart kine pod to get everything back online. Obviously the data was lost. I am using DigitalOcean's Kubernetes service. So, it was PVC on their cloud. Not sure if that helps. From the looks of it, it seems that the sqlite.db file got corrupted some way and fresh start was the only solution.

brandond commented 3 months ago

That yaml would have been useful information to include in the original report.

Without actual logs from kine its hard to say what might have been going on. You haven't even included the full TTL event watch failed to get start revision error message; it should have included an error cause as part of the message.

If you can provide full logs, or steps to reproduce, please do so. Otherwise I'm liable to close this out due to insufficient information.