henrygd / beszel

Lightweight server monitoring hub with historical data, docker stats, and alerts.
MIT License
2.69k stars 85 forks source link

ERROR Error encoding stats err="json: unsupported value: NaN" #280

Open Toudahl opened 4 days ago

Toudahl commented 4 days ago

Hi,

One of my agents , running in docker is reporting some errors.

I am not sure how to go about debugging them it, or even fixing it.

2024/11/09 17:02:35 INFO Detected root device name=mmcblk0p2
2024/11/09 17:02:35 INFO Detected network interface name=eth0 sent=818916 recv=989631
2024/11/09 17:02:35 INFO Starting SSH server address=:45876
2024/11/09 17:15:14 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:15:14 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:22:35 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:22:35 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:26:56 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:26:56 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:51:05 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:51:05 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:59:25 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:59:25 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 18:05:25 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 18:05:25 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 18:05:44 ERROR Error encoding stats err=EOF
henrygd commented 4 days ago

Please add LOG_LEVEL=debug to the agent.

What Docker version and device / OS are you using?

Toudahl commented 4 days ago

I have enabled debug logging, and I am waiting for the error to reappear.

I am running docker 24.0.6, debian 11 on a rpi3b+.

The agent is version 0.7.2.

Apologies for not including this before.

Toudahl commented 4 days ago
2024/11/10 22:27:17 DEBUG Getting stats
2024/11/10 22:27:17 DEBUG Temperature sensors="[{\"sensorKey\":\"cpu_thermal\",\"temperature\":62.3,\"sensorHigh\":0,\"sensorCritical\":110}]"
2024/11/10 22:27:17 DEBUG sysinfo data="{Hostname:rpi3bp-01 KernelVersion:6.1.21-v8+ Cores:4 Threads:4 CpuModel:Cortex-A53 Uptime:107421 Cpu:0.33 MemPct:19.53 DiskPct:21.68 Bandwidth:0 AgentVersion:0.7.2}"
2024/11/10 22:27:17 DEBUG System stats data="{Stats:{Cpu:0.33 MaxCpu:0 Mem:0.89 MemUsed:0.17 MemPct:19.53 MemBuffCache:0.49 MemZfsArc:0 Swap:0.1 SwapUsed:0 DiskTotal:14.32 DiskUsed:2.97 DiskPct:21.68 DiskReadPs:0.05 DiskWritePs:0 MaxDiskReadPs:0 MaxDiskWritePs:0 NetworkSent:0 NetworkRecv:0 MaxNetworkSent:0 MaxNetworkRecv:0 Temperatures:map[cpu_thermal:62.3] ExtraFs:map[]} Info:{Hostname:rpi3bp-01 KernelVersion:6.1.21-v8+ Cores:4 Threads:4 CpuModel:Cortex-A53 Uptime:107421 Cpu:0.33 MemPct:19.53 DiskPct:21.68 Bandwidth:0 AgentVersion:0.7.2} Containers:[]}"
2024/11/10 22:27:17 DEBUG Getting stats
2024/11/10 22:27:17 DEBUG Temperature sensors="[{\"sensorKey\":\"cpu_thermal\",\"temperature\":62.3,\"sensorHigh\":0,\"sensorCritical\":110}]"
2024/11/10 22:27:17 DEBUG sysinfo data="{Hostname:rpi3bp-01 KernelVersion:6.1.21-v8+ Cores:4 Threads:4 CpuModel:Cortex-A53 Uptime:107421 Cpu:66.67 MemPct:19.53 DiskPct:21.68 Bandwidth:0.16 AgentVersion:0.7.2}"
2024/11/10 22:27:17 DEBUG System stats data="{Stats:{Cpu:66.67 MaxCpu:0 Mem:0.89 MemUsed:0.17 MemPct:19.53 MemBuffCache:0.49 MemZfsArc:0 Swap:0.1 SwapUsed:0 DiskTotal:14.32 DiskUsed:2.97 DiskPct:21.68 DiskReadPs:0 DiskWritePs:0 MaxDiskReadPs:0 MaxDiskWritePs:0 NetworkSent:0.12 NetworkRecv:0.04 MaxNetworkSent:0 MaxNetworkRecv:0 Temperatures:map[cpu_thermal:62.3] ExtraFs:map[]} Info:{Hostname:rpi3bp-01 KernelVersion:6.1.21-v8+ Cores:4 Threads:4 CpuModel:Cortex-A53 Uptime:107421 Cpu:66.67 MemPct:19.53 DiskPct:21.68 Bandwidth:0.16 AgentVersion:0.7.2} Containers:[]}"
2024/11/10 22:27:18 DEBUG Docker stats data="[0x40000e64d0 0x400006a310]"
2024/11/10 22:27:18 DEBUG Extra filesystems data=map[]
2024/11/10 22:27:18 DEBUG Docker stats data="[0x400006a310 0x40000e64d0]"
2024/11/10 22:27:18 DEBUG Extra filesystems data=map[]
2024/11/10 22:27:18 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/10 22:27:18 ERROR Error encoding stats err="json: unsupported value: NaN"
henrygd commented 4 days ago

Thanks. Is it really logging everything twice or was that just a timing coincidence when you started it up?

Can you please run docker stats --no-stream and check if it shows zero values for anything?

I'd recommend upgrading Docker if possible. There's a known bug in 24 that I had to specifically add a workaround for. I don't know if related but it's a good idea to update either way.

Toudahl commented 4 days ago

I have now updated to docker 27 - i must have forgotten this rpi when i updated my cluster. The stats show 0% cpu for beszel, and 0 in network I/O for both containers running on this pi.

I have this in my /boot/cmdline.txt console=serial0,115200 console=tty1 root=PARTUUID=202246d5-02 rootfstype=ext4 fsck.repair=yes rootwait cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1

The logs are showing up twice, both in my syslog server - and when i inspect the logs manually using the docker command. It is happening consistently, not just close to a startup of the agent.

henrygd commented 3 days ago

The issue may be that multiple instances of the hub are requesting metrics from the agent at the same time.

There's no lock coordination between multiple instances atm. Try to make sure only one total instance is running.

Toudahl commented 3 days ago

There is only one hub running. image

On the hub i am actually seeing this

infrastructure_beszel.1.a6vox3xn33lp@rpi4-02    | 2024/11/11 19:10:38 failed to get active systems err context canceled; failed query: SELECT `@@__invalidCollectionModelOrIdentifier`.* FROM `@@__invalidCollectionModelOrIdentifier` WHERE status = 'up'
henrygd commented 3 days ago

I think this is the same issue as #216 actually. You can also reference this thread about swarm: #17.

It seems to be possible to get working properly, but I haven't set up swarm yet to figure it out.

With swarm, it would be better for the agents to use an internal timer and send stats to the hub, rather than how it's set up currently where the hub requests from the agent. I want to add that option but it's not going to be in the very near future.

I think what's going wrong is that the hub connections are not being handled by the correct nodes, and sometimes one node is being assigned two connections.

Re: failed to get active systems error - that's the first time I've seen someone hit that. Are you using the arm7 image for the hub by any chance? There's an open issue about that image having a problem with the DB (#133) but I haven't been able to troubleshoot it because I don't have an arm7 device.

It may go away if you can figure out a working configuration with the swarm nodes, otherwise you may want to delete the beszel_data folder and start over if it continues to happen.

Toudahl commented 2 days ago

Sorry for the slow reply time.

Some more details of my system.

I have 1 swarm, with 4 nodes. These are all running debian 11, and docker 27. The hosts them self are 3 x rpi4 and 1 x rpi4b This is where the screengrab is from. As you can see it is running 1 hub, and 4 agents using host network. The hub is connecting to the agents through host ip. So no docker networking involved there.

The original problem i posted about takes places on a separate pi, with details as described.

So aside from the agent running on the rpi3b+, the hub and other agents are running on arm.

Based on this information, I do not think it is likely that the problem is caused by swarm. But if you wish to test it your self, you can easily make a swarm "cluster" with just one node. Perhaps you can trigger the issues there.