Open Kampe opened 2 years ago
describe output:
statefulset
Name: nats-edge
Namespace: default
CreationTimestamp: Mon, 18 Apr 2022 14:58:09 -0500
Selector: app=nats-edge
Labels: app=nats-edge
app.kubernetes.io/instance=messaging
Annotations: <none>
Replicas: 3 desired | 3 total
Update Strategy: RollingUpdate
Partition: 0
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nats-edge
Annotations: kubectl.kubernetes.io/restartedAt: 2022-04-19T15:06:18-05:00
Service Account: nats-server
Containers:
nats:
Image: nats:2.7.4
Ports: 4222/TCP, 7422/TCP, 6222/TCP, 8222/TCP, 7777/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
/nats-server
--config
/etc/nats-config/nats.conf
Requests:
cpu: 0
Liveness: http-get http://:8222/ delay=10s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:8222/ delay=10s timeout=5s period=10s #success=1 #failure=3
Environment:
POD_NAME: (v1:metadata.name)
POD_NAMESPACE: (v1:metadata.namespace)
CLUSTER_ADVERTISE: $(POD_NAME).nats-edge.$(POD_NAMESPACE).svc
LEAFNODE_TOKEN: <set to the key 'token' in secret 'nats-creds'> Optional: false
Mounts:
/data/jetstream from nats-js-sts-vol (rw)
/etc/nats-config from config-volume (rw)
/var/run/nats from pid (rw)
reloader:
Image: natsio/nats-server-config-reloader:0.6.3
Port: <none>
Host Port: <none>
Command:
nats-server-config-reloader
-pid
/var/run/nats/nats.pid
-config
/etc/nats-config/nats.conf
Environment: <none>
Mounts:
/etc/nats-config from config-volume (rw)
/var/run/nats from pid (rw)
metrics:
Image: natsio/prometheus-nats-exporter:0.9.1
Port: 7777/TCP
Host Port: 0/TCP
Args:
-connz
-routez
-serverz
-subz
-gatewayz
-varz
-channelz
-prefix=nats
-use_internal_server_id
-DV
http://localhost:8222
Environment: <none>
Mounts: <none>
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nats-edge-config
Optional: false
pid:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Volume Claims:
Name: nats-js-sts-vol
StorageClass:
Labels: <none>
Annotations: <none>
Capacity: 2Gi
Access Modes: [ReadWriteOnce]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulDelete 2m43s statefulset-controller delete Pod nats-edge-2 in StatefulSet nats-edge successful
Normal SuccessfulCreate 2m41s (x2 over 24h) statefulset-controller create Pod nats-edge-2 in StatefulSet nats-edge successful
Normal SuccessfulDelete 2m21s statefulset-controller delete Pod nats-edge-1 in StatefulSet nats-edge successful
Normal SuccessfulCreate 2m11s (x2 over 24h) statefulset-controller create Pod nats-edge-1 in StatefulSet nats-edge successful
pod
:
Name: nats-edge-1
Namespace: default
Priority: 0
Node: k3s-leader/192.168.1.139
Start Time: Tue, 19 Apr 2022 15:06:50 -0500
Labels: app=nats-edge
controller-revision-hash=nats-edge-5bb9f677cd
statefulset.kubernetes.io/pod-name=nats-edge-1
Annotations: kubectl.kubernetes.io/restartedAt: 2022-04-19T15:06:18-05:00
Status: Running
IP: 10.42.0.170
IPs:
IP: 10.42.0.170
Controlled By: StatefulSet/nats-edge
Containers:
nats:
Container ID: containerd://be6c774e089d619efffd07358ffc31e00895731e18d83ff41339be12704e24f1
Image: nats:2.7.4
Image ID: docker.io/library/nats@sha256:6f86d5bac0f78b01cf5f9f707008403b0792ff5029817f8afc23c895c5215cc6
Ports: 4222/TCP, 7422/TCP, 6222/TCP, 8222/TCP, 7777/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
/nats-server
--config
/etc/nats-config/nats.conf
State: Running
Started: Tue, 19 Apr 2022 15:06:51 -0500
Ready: True
Restart Count: 0
Requests:
cpu: 0
Liveness: http-get http://:8222/ delay=10s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:8222/ delay=10s timeout=5s period=10s #success=1 #failure=3
Environment:
POD_NAME: nats-edge-1 (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
CLUSTER_ADVERTISE: $(POD_NAME).nats-edge.$(POD_NAMESPACE).svc
LEAFNODE_TOKEN: <set to the key 'token' in secret 'nats-creds'> Optional: false
Mounts:
/data/jetstream from nats-js-sts-vol (rw)
/etc/nats-config from config-volume (rw)
/var/run/nats from pid (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8ccx9 (ro)
reloader:
Container ID: containerd://6a4cfc309b6425d2d21dc26411caf509cc77b8191d7d3d258eb6114efacccfc9
Image: natsio/nats-server-config-reloader:0.6.3
Image ID: docker.io/natsio/nats-server-config-reloader@sha256:b310e8766c507a20dae7c634b81b49b47682e2ec985c0cd8b3c746660febb9fe
Port: <none>
Host Port: <none>
Command:
nats-server-config-reloader
-pid
/var/run/nats/nats.pid
-config
/etc/nats-config/nats.conf
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 19 Apr 2022 15:12:37 -0500
Finished: Tue, 19 Apr 2022 15:12:37 -0500
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/etc/nats-config from config-volume (rw)
/var/run/nats from pid (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8ccx9 (ro)
metrics:
Container ID: containerd://b31489fbcbe29bb6daa6ac9b42c2bb7ad3961de4a54de7af9c59cc40cd57f8e5
Image: natsio/prometheus-nats-exporter:0.9.1
Image ID: docker.io/natsio/prometheus-nats-exporter@sha256:4665cdc7e749548ee9999764f29665aaeeecac95f310319016764ba2ab25e1c1
Port: 7777/TCP
Host Port: 0/TCP
Args:
-connz
-routez
-serverz
-subz
-gatewayz
-varz
-channelz
-prefix=nats
-use_internal_server_id
-DV
http://localhost:8222
State: Running
Started: Tue, 19 Apr 2022 15:06:51 -0500
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8ccx9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nats-js-sts-vol:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nats-js-sts-vol-nats-edge-1
ReadOnly: false
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nats-edge-config
Optional: false
pid:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-8ccx9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m30s default-scheduler Successfully assigned default/nats-edge-1 to k3s-leader
Normal Pulled 7m29s kubelet Container image "nats:2.7.4" already present on machine
Normal Created 7m29s kubelet Created container nats
Normal Started 7m29s kubelet Started container nats
Normal Pulled 7m29s kubelet Container image "natsio/prometheus-nats-exporter:0.9.1" already present on machine
Normal Created 7m29s kubelet Created container metrics
Normal Started 7m29s kubelet Started container metrics
Normal Pulled 6m43s (x4 over 7m29s) kubelet Container image "natsio/nats-server-config-reloader:0.6.3" already present on machine
Normal Created 6m43s (x4 over 7m29s) kubelet Created container reloader
Normal Started 6m43s (x4 over 7m29s) kubelet Started container reloader
Warning BackOff 2m24s (x26 over 7m28s) kubelet Back-off restarting failed container
Thanks @Kampe for the report, looking into this...
Excellent, thank you!
More interestingly, sometimes you can get one or two pods of a replica-set to stand up
nats-edge-0 3/3 Running 0 24h
nats-edge-1 2/3 CrashLoopBackOff 7 15m
nats-edge-2 3/3 Running 0 15m
Hi @Kampe would it be possible to try setting the following in your Helm chart to check if this fixes it for you?
reloader:
enabled: true
image: wallyqs/nats-server-config-reloader:fclose
We've had great preliminary success with this image in a few environments!
I'm facing the very same issue in KinD cluster running on Linux. I already tried both your ad-hoc image wallyqs/nats-server-config-reloader:fclose
and also official build 0.7.0.
I managed to customize container's command to sleep 3600
, connect into it and run the original command nats-server-config-reloader -pid /var/run/nats/nats.pid -config /etc/nats-config/nats.conf
in strace
.
The result is that the Too many files
error comes from inotify_init1(IN_CLOEXEC) = -1 EMFILE (No file descriptors available)
syscall.
So (at least in my case) the root cause is not by missing f.Close
but much sooner - in:
configWatcher, err := fsnotify.NewWatcher()
that actually calls unix.InotifyInit1(unix.IN_CLOEXEC)
EDIT: I checked my sysctl settings. I have fs.inotify.max_user_watches = 204800
. I increased it to 1M, but it didn't help. What really helped me was to set fs.inotify.max_user_instances=256
(was 128). Now the reloader starts without further issues.
Hope it helps.
The issue still appeears on 0.11.0
and current master branch (d7f2a3b
commit).
the NATS install that comes bundled in with TestKube also suffers from the same issue:
pod/testkube-test-nats-0 2/3 CrashLoopBackOff 25 (3m8s ago) 54m
$ kubectl logs -n testkube-test pod/testkube-test-nats-0 reloader 2023/08/10 23:57:24 Starting NATS Server Reloader v0.7.4 Error: too many open files
This is installed by helm chart version: testkube-1.13.33
I tried increasing ulimit -n (max open file descriptors) from 1024 to 4096 but it made no difference.
Anything I can do to further narrow it down?
Same issue here. It seems random and I can get 1-3 pods running, but it was hard to do and I'm not sure why. The host has been configured with a large open file allowance.
I ran into this issue as well running NATS 2.10.11 (NATS Server Reloader v0.14.1) on a k3s v1.28.3 cluster on Ubuntu 22.04 nodes. I had just one of my nodes having this issue as well. I noticed that I had a high number of pod restarts for a deployment on only that node having issues. After resolving the issue with my deployment and restarting my node that had issues everything has been working fine. I suspect that my excessive pod restarts caused too many files to remain open.
I know that others have discussed ulimits. I just wanted to point out that my k3s processes are running as a systemd service because Ubuntu has switched to systemd from SysV. I know systemd runs each service with it's own open file limits and ignores ulimit. Here is a link to Red Hat documentation, but will likely pertain to whatever Linux distribution is using systemd. I hope someone else finds this helpful.
I'll just add few notes from my case. For me, after setting all the open file limits (systemd, pam, etc) as permissive as possible, I was still running into exact same error. Turns out it was caused by fs.inotify.max_user_instances
. On my system that one was quite low. Bumping it helped.
Posting this here as the code seems to live here in this repo: https://github.com/nats-io/k8s/issues/488