Closed jrovira-kumori closed 3 months ago
Thanks for the contribution! Please sign-off your commit for DCO.
@bruth can you comment if this is the best way to do this?
Thanks for the contribution! Please sign-off your commit for DCO.
@bruth can you comment if this is the best way to do this?
Done!
fwiw the bootstrap key is internal to k3s, I'm not sure this is the correct thing to create in kine.
nats should already create a /registry/health
key on startup. Do we just need another second write to that key to increment the revision to 1?
https://github.com/k3s-io/kine/blob/37736729c40f1e7ad18521d1eb5c2a13e314ebde/pkg/drivers/nats/backend.go#L116-L125
The /registry/health
is never added in the minimal repro or other tests I have done. It has only worked whenever I was debugging the Kine process and added a breakpoint.
There seems to be some kind of race condition going on which causes the error Failed to create health check key: context deadline exceeded
to happen.
Here is an example of the Kine logs from the repro.
level=debug msg="using config &nats.Config{clientURL:\"nats://nats:4222\", clientOptions:[]nats.Option(nil), revHistory:0xa, bucket:\"kine\", replicas:1, slowThreshold:500000000, noEmbed:true, dontListen:false, serverConfig:\"\", stdoutLogging:false, host:\"nats\", port:4222, dataDir:\"\"}"
level=info msg="connecting to nats://nats:4222"
level=info msg="using bucket: kine"
level=info msg="metrics server is starting to listen at :8080"
level=info msg="starting metrics server path /metrics"
level=info msg="bucket initialized: kine"
level=error msg="Failed to create health check key: context deadline exceeded"
level=info msg="Kine available at http://127.0.0.1:2379"
level=trace msg="LIST /registry/apiextensions.k8s.io/customresourcedefinitions/, start=/registry/apiextensions.k8s.io/customresourcedefinitions/, limit=10001, rev=0 => rev=0, kvs=0, err=<nil>, duration=2.334µs"
level=trace msg="LIST key=/registry/apiextensions.k8s.io/customresourcedefinitions/, end=/registry/apiextensions.k8s.io/customresourcedefinitions/, revision=0, currentRev=0 count=0, limit=10000"
level=trace msg="COUNT /registry/apiextensions.k8s.io/customresourcedefinitions/, rev=0 => rev=0, count=0, err=<nil>, duration=461ns"
level=trace msg="LIST COUNT key=/registry/apiextensions.k8s.io/customresourcedefinitions/, end=/registry/apiextensions.k8s.io/customresourcedefinitions/, revision=0, currentRev=0 count=0"
level=trace msg="COUNT /registry/events/, rev=0 => rev=0, count=0, err=<nil>, duration=592ns"
level=trace msg="LIST COUNT key=/registry/events/, end=/registry/events/, revision=0, currentRev=0 count=0"
level=trace msg="COUNT /registry/resourcequotas/, rev=0 => rev=0, count=0, err=<nil>, duration=290ns"
level=trace msg="LIST COUNT key=/registry/resourcequotas/, end=/registry/resourcequotas/, revision=0, currentRev=0 count=0"
level=trace msg="LIST /registry/resourcequotas/, start=/registry/resourcequotas/, limit=10001, rev=0 => rev=0, kvs=0, err=<nil>, duration=7.494µs"
level=trace msg="LIST key=/registry/resourcequotas/, end=/registry/resourcequotas/, revision=0, currentRev=0 count=0, limit=10000"
level=trace msg="COUNT /registry/secrets/, rev=0 => rev=0, count=0, err=<nil>, duration=310ns"
level=trace msg="LIST COUNT key=/registry/secrets/, end=/registry/secrets/, revision=0, currentRev=0 count=0"
fwiw the bootstrap key is internal to k3s, I'm not sure this is the correct thing to create in kine.
The constant string "bootstrap"
in the commits is meaningless. I believe any string would work here as the calls are done to the nats-go
KeyValue api directly.
It should also have no effect as the value is immediately deleted.
level=error msg="Failed to create health check key: context deadline exceeded"
It sounds like the root cause here is that the health check key isn't getting created. Rather than creating and deleting another key, lets just fix the bug that's causing that key to not get created?
It sounds like the root cause here is that the health check key isn't getting created. Rather than creating and deleting another key, lets just fix the bug that's causing that key to not get created?
The context deadline exceeded
error was coming from github.com/nats-io/nats.go/jetstream@v1.31.0
. The getMsg()
wraps the context in a 5 second deadline when there is no other deadline defined.
I was unable to identify why the method times out after the stream is created in NATS. Maybe someone with more in-depth knowledge on its inner workings can clarify it. Anyhow, adding some retry logic to the Start()
method seems to resolve the issues that were showing up.
I have removed the previous bootstrap key creation when initializing the NATS backend in favour of the retry logic.
@brandond closed the PR by mistake...
Will take a look today.
The retry does work, but I am digging into why, once the KV bucket is created, the first write seemingly times out. It may be that the bucket is actually not yet ready to receive writes, but that should not be the case so I will dig a bit deeper on it.
@bruth any update on this? I don't think we should fix this on the kine side if the issue is in nats.
The latest version of the NATS Go client appears to fix the issue. At least I can't reproduce the bug with it. I will open a PR to update the NATS deps for Kine.
The latest version of the NATS Go client appears to fix the issue. At least I can't reproduce the bug with it. I will open a PR to update the NATS deps for Kine.
I have tried the updated version from #281. From my testing, the issue still persists.
It can be reproduced easily with just nats-server
(docker) and the kine
binary. On a separate terminal run NATS:
docker run --rm --network host nats:2.10.11 -js -DVV
Then compile and run the updated kine
version:
git clone https://github.com/nats-io/kine
cd kine
git switch update-nats-versions
go build .
./kine --debug --endpoint "nats://?noEmbed=true"
Kine will display the same error.
ERRO[2024-03-04T10:08:56.261135233+01:00] Failed to create health check key: context deadline exceeded
Full logs:
Any update on this ? We are also facing the similar issue
Found the root cause and updated my previous PR: https://github.com/k3s-io/kine/pull/281. @jrovira-kumori feel free to validate on your end.
I am happy to confirm that the issue has been resolved! I have tested it again with kube-apiserver
(just in case) and it runs like a charm.
Thanks for the help!
Fix for #274.
I am guessing this issue did not appear with K3s because the first command it runs when initializing is a
CREATE /bootstrap/...
. Therefore futureCOUNT ...
calls would always return!= 0
.This is not the case with Kubernetes KubeAPI which reads before writing anything and gets stuck in a loop of
illegal resource version from storage: 0
.This PR fixes the issue with the NATS driver by incrementing the BucketRevision number to 1 by creating and deleting a bootstrap key if a BucketRevision of 0 is detected at startup.