rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.45k stars 258 forks source link

RKE2 fails to start using NATS with Kine #6186

Open sdemura opened 1 month ago

sdemura commented 1 month ago

Environmental Info: RKE2 Version:

>rke2 --version
rke2 version v1.28.9+rke2r1 (07bf87f9118c1386fa73f660142cc28b5bef1886)
go version go1.21.9 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

> uname -a
Linux jammy-01 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

single node

Describe the bug:

After seeing https://nats.io/blog/exploring-nats-as-a-backend-for-k3s/ I was hopeful this would work with the new Kine support, but it appears it doesn't.

Steps To Reproduce:

Running nats externally

> ./nats-server -js
[4892] 2024/06/12 17:55:32.196742 [INF] Starting nats-server
[4892] 2024/06/12 17:55:32.196803 [INF]   Version:  2.10.14
[4892] 2024/06/12 17:55:32.196804 [INF]   Git:      [31af767]
[4892] 2024/06/12 17:55:32.196810 [INF]   Name:     NCOCM4TEGBQHNZQGEIUJLBXICIOEVNGN4KP5IWJMJCAFVPJ4V3DV4VZY
[4892] 2024/06/12 17:55:32.196813 [INF]   Node:     VAUHWSkw
[4892] 2024/06/12 17:55:32.196816 [INF]   ID:       NCOCM4TEGBQHNZQGEIUJLBXICIOEVNGN4KP5IWJMJCAFVPJ4V3DV4VZY
[4892] 2024/06/12 17:55:32.197008 [INF] Starting JetStream
[4892] 2024/06/12 17:55:32.197093 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[4892] 2024/06/12 17:55:32.197097 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[4892] 2024/06/12 17:55:32.197098 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[4892] 2024/06/12 17:55:32.197099 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[4892] 2024/06/12 17:55:32.197100 [INF]
[4892] 2024/06/12 17:55:32.197101 [INF]          https://docs.nats.io/jetstream
[4892] 2024/06/12 17:55:32.197102 [INF]
[4892] 2024/06/12 17:55:32.197103 [INF] ---------------- JETSTREAM ----------------
[4892] 2024/06/12 17:55:32.197107 [INF]   Max Memory:      8.71 GB
[4892] 2024/06/12 17:55:32.197109 [INF]   Max Storage:     54.67 GB
[4892] 2024/06/12 17:55:32.197111 [INF]   Store Directory: "/tmp/nats/jetstream"
[4892] 2024/06/12 17:55:32.197112 [INF] -------------------------------------------
[4892] 2024/06/12 17:55:32.197324 [INF]   Starting restore for stream '$G > KV_kine'
[4892] 2024/06/12 17:55:32.197440 [INF]   Restored 1 messages for stream '$G > KV_kine' in 0s
[4892] 2024/06/12 17:55:32.197519 [INF] Listening for client connections on 0.0.0.0:4222
[4892] 2024/06/12 17:55:32.197598 [INF] Server is ready

Configure rke2 to use external, nats and explicitly set noEmbed

> grep datastore-endpoint /etc/rancher/rke2/config.yaml
datastore-endpoint: nats://?noEmbed
> sudo rke2 server --debug
WARN[0000] not running in CIS mode
INFO[0000] Applying Pod Security Admission Configuration
INFO[0000] Starting rke2 v1.28.9+rke2r1 (07bf87f9118c1386fa73f660142cc28b5bef1886)
INFO[0000] Starting temporary kine to reconcile with datastore
DEBU[0000] using config &nats.Config{clientURL:"nats://localhost:4222", clientOptions:[]nats.Option(nil), revHistory:0xa, bucket:"kine", replicas:1, slowThreshold:500000000, noEmbed:false, dontListen:false, serverConfig:"", stdoutLogging:false, host:"localhost", port:4222, dataDir:""}
INFO[0000] connecting to nats://localhost:4222
INFO[0000] using bucket: kine
INFO[0000] bucket initialized: kine
INFO[0000] Kine available at unix://kine.sock
ERRO[0001] btree watcher error: context canceled
INFO[0001] generated self-signed CA certificate CN=rke2-client-ca@1718216055: notBefore=2024-06-12 18:14:15.144914536 +0000 UTC notAfter=2034-06-10 18:14:15.144914536 +0000 UTC
INFO[0001] certificate CN=system:admin,O=system:masters signed by CN=rke2-client-ca@1718216055: notBefore=2024-06-12 18:14:15 +0000 UTC notAfter=2025-06-12 18:14:15 +0000 UTC
INFO[0001] certificate CN=system:rke2-supervisor,O=system:masters signed by CN=rke2-client-ca@1718216055: notBefore=2024-06-12 18:14:15 +0000 UTC notAfter=2025-06-12 18:14:15 +0000 UTC
INFO[0001] certificate CN=system:kube-controller-manager signed by CN=rke2-client-ca@1718216055: notBefore=2024-06-12 18:14:15 +0000 UTC notAfter=2025-06-12 18:14:15 +0000 UTC
INFO[0001] certificate CN=system:kube-scheduler signed by CN=rke2-client-ca@1718216055: notBefore=2024-06-12 18:14:15 +0000 UTC notAfter=2025-06-12 18:14:15 +0000 UTC
INFO[0001] certificate CN=system:apiserver,O=system:masters signed by CN=rke2-client-ca@1718216055: notBefore=2024-06-12 18:14:15 +0000 UTC notAfter=2025-06-12 18:14:15 +0000 UTC
INFO[0001] certificate CN=system:kube-proxy signed by CN=rke2-client-ca@1718216055: notBefore=2024-06-12 18:14:15 +0000 UTC notAfter=2025-06-12 18:14:15 +0000 UTC
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1fe6e6e]

goroutine 217 [running]:
github.com/k3s-io/kine/pkg/drivers/nats.(*KeyValue).btreeWatcher(0xc000cfec00, {0x46a4638, 0xc000cf6f00})
        /go/pkg/mod/github.com/k3s-io/kine@v0.11.7/pkg/drivers/nats/kv.go:324 +0xee
github.com/k3s-io/kine/pkg/drivers/nats.NewKeyValue.func1()
        /go/pkg/mod/github.com/k3s-io/kine@v0.11.7/pkg/drivers/nats/kv.go:521 +0x65
created by github.com/k3s-io/kine/pkg/drivers/nats.NewKeyValue in goroutine 1
        /go/pkg/mod/github.com/k3s-io/kine@v0.11.7/pkg/drivers/nats/kv.go:517 +0x146

It appears that rke2 is ignoring the NATs paraemeter, as I'd expect noEmbed:false to be true here:

DEBU[0000] using config &nats.Config{clientURL:"nats://", clientOptions:[]nats.Option(nil), revHistory:0xa, bucket:"kine", replicas:1, slowThreshold:500000000, noEmbed:false, dontListen:false, serverConfig:"", 

Also wondering if it has to do with RKE2's use of unixs:// instead of unix:// for the Kine socket.

brandond commented 1 month ago

ERRO[0001] btree watcher error: context canceled

Not sure what that's about...

@bruth are you available to take a look at this?

bruth commented 1 month ago

Indeed, will check it out.

brandond commented 1 month ago

I will say that rke2 and k3s do something goofy when tls is enabled, that is inherited from etcd. It starts up once using a plaintext listener to extract encrypted bootstrap data that includes the CA certs, and then shuts down, and starts up again using the configured certs that it extracted the first time. That message is probably related to that, but I don't know if it's also related to the failure to properly configure the nats client. I'm not sure we've actually tested anything except for sqlite yet, in rke2.