Closed leewillis77 closed 4 weeks ago
Seems to be caused by rammitmq (quick google search) which is not utilized by teslamate. Could that be ?
@swiffer There's nothing else running on the box other than teslamate, and the processes in question look like they're coming from docker:
systemd(1)-+-agetty(455)
|-agetty(456)
|-alsactl(331)
|-avahi-daemon(344)---avahi-daemon(356)
|-colord(1279)-+-{colord}(1280)
| `-{colord}(1282)
|-containerd(454)-+-{containerd}(458)
| |-{containerd}(459)
| |-{containerd}(460)
| |-{containerd}(461)
| |-{containerd}(462)
| |-{containerd}(480)
| |-{containerd}(481)
| |-{containerd}(482)
| |-{containerd}(498)
| |-{containerd}(812)
| |-{containerd}(813)
| `-{containerd}(814)
|-containerd-shim(828)-+-tini(889)-+-beam.smp(1041)-+-erl_child_setup(1438)---inet_gethost(1486)---inet_gethost(1487)
| | | |-{beam.smp}(1435)
| | | |-{beam.smp}(1436)
| | | |-{beam.smp}(1437)
| | | |-{beam.smp}(1439)
| | | |-{beam.smp}(1440)
| | | |-{beam.smp}(1441)
| | | |-{beam.smp}(1442)
| | | |-{beam.smp}(1443)
| | | |-{beam.smp}(1444)
| | | |-{beam.smp}(1445)
| | | |-{beam.smp}(1446)
| | | |-{beam.smp}(1447)
| | | |-{beam.smp}(1448)
| | | |-{beam.smp}(1449)
| | | |-{beam.smp}(1450)
| | | |-{beam.smp}(1451)
| | | |-{beam.smp}(1452)
| | | |-{beam.smp}(1453)
| | | |-{beam.smp}(1454)
| | | |-{beam.smp}(1455)
| | | |-{beam.smp}(1456)
| | | |-{beam.smp}(1457)
| | | `-{beam.smp}(1458)
I do have DISABLE_MQTT=true
in my docker compose since I'm not using any MQTT services if that helps.
Ok, rabbit is based on erlang, that's where it's coming from. I wonder if #4296 is causing this.
@leewillis77 can you try sertting ERL_MAX_PORTS
Hi;
Thanks for the suggestion. I added the following under environment
in my docker-compose.yml
:
- ERL_MAX_PORTS=1024
It didn't make any difference - was that the correct way to make the change? E.g.
services:
teslamate:
image: teslamate/teslamate:latest
restart: always
environment:
- DATABASE_USER=teslamate
- DATABASE_NAME=teslamate
- DATABASE_HOST=database
- DISABLE_MQTT=true
- ERL_MAX_PORTS=1024
The thread you linked to seems to be concerned with high memory usage, but the problem I'm experiencing is high CPU, not high memory.
I have the same issue. Running in a docker on Raspberry Pi. Completely new fresh install of Teslamate made today. Version 1.31.0
@leewillis77 yes, you are right, sorry
Could one of you try
ghcr.io/teslamate-org/teslamate/teslamate:pr-4300
That is the latest before upgrading elixir / erlang
I would if I knew how :)
Just replace image in docker compose file, see here
I tried that actually.:
$ docker compose pull
[+] Pulling 3/3
✘ grafana Error context canceled 0.5s
✘ teslamate Error Head "https://ghcr.io/v2/teslamate-org/teslamate/teslamate/manifests/pr-4300": denied 0.5s
✘ database Error context canceled
okay, please try again with
image: ghcr.io/teslamate-org/teslamate:pr-4300
the contrib doc had an issue...
beam.smp CPU usage is minimal (1.3%) with ghcr.io/teslamate-org/teslamate:pr-4300 vs > 100% as originally reported with the latest release.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14533 10000 20 0 156660 84144 6616 S 1.3 4.4 0:07.23 beam.smp
@leewillis77 - thanks testing. I guess someone else needs to step in but seems to be caused by the erlang/elixir update @JakobLichterfeld
For now it should be fine to stay on pr-4300 until a fix is on the horizon.
I can confirm the same with image: ghcr.io/teslamate-org/teslamate:pr-4300 Version 1.30.2-dev
beam.smp CPU usage has dropped significantly to around 1,1%.
Out of curiosity - could you check pr-4303 as well, just to make sure if it's caused by dep updates or erlang
Out of curiosity - could you check pr-4303 as well, just to make sure if it's caused by dep updates or erlang
beam.smp CPU is high (over 100%) with pr-4303.
I'm unfortunately not particularly knowledgeable about the erlang ecosystem, but looking back at the pstree
output I posted in https://github.com/teslamate-org/teslamate/issues/4307#issuecomment-2440094006 which references inet_gethost I wonder if https://github.com/erlang/otp/issues/8961 may be related?
Same issue for me, i switched to pr-4300 and it fixed the problem for now.
It sounds like in that linked issue the process is just hanging, not using lots of CPU.
Beam is a VM. Probably will have problems debugging this with system tools, think you will need beam tools.
One such tool is the erlang observer. https://blog.sequinstream.com/how-we-used-elixirs-observer-to-hunt-down-bottlenecks/
Having said that, it may be fiddly to get this working in a Docker container. Either the observer needs to have X access somehow (e.g. https://github.com/c0b/docker-elixir/issues/19#issuecomment-257637757), or you need to run a local beam that connects to the remote beam somehow (https://stackoverflow.com/questions/42164298/using-the-erlang-observer-app-with-a-remote-elixir-phoenix-server-inside-docker). Either way likely to be fiddly.
Also: It seems that our nix flake doesn't compile in the observer. I tried some things, but so far have been failures. Later: Oh, wait, this fixes it:
diff --git a/mix.exs b/mix.exs
index d159eefd..afa1c98a 100644
--- a/mix.exs
+++ b/mix.exs
@@ -27,7 +27,7 @@ defmodule TeslaMate.MixProject do
def application do
[
mod: {TeslaMate.Application, []},
- extra_applications: [:logger, :runtime_tools]
+ extra_applications: [:logger, :runtime_tools, :wx, :observer]
]
end
(solution copied from https://stackoverflow.com/a/76760603)
I can reproduce the high CPU usage, without any high CPU usage, system runs fine with multiple docker containers on Raspberry Pi 3B+ currently logging fine a charge.
Thanks for your investigation!
Later: Oh, wait, this fixes it:
--> PR: #4311
Perhaps we can disable beams busy waiting: https://stressgrid.com/blog/beam_cpu_usage/
My running theory is somewhere we have a loop or recursion, and as a result of the upgrades we did we accidentally broke the exit condition, and as a result the loop is running for ever and not exiting.
Wondering:
@JakobLichterfeld Running with pr-4311 teslamate doesn't work (is inaccessible on port 4000) and the logs are filled with this:
teslamate-1 | =CRASH REPORT==== 28-Oct-2024::08:34:00.781993 ===
teslamate-1 | crasher:
teslamate-1 | initial call: application_master:init/3
teslamate-1 | pid: <0.2471.0>
teslamate-1 | registered_name: []
teslamate-1 | exception exit: {{shutdown,
teslamate-1 | {failed_to_start_child,on_load,
teslamate-1 | {on_load_function_failed,gl,
teslamate-1 | {error,
teslamate-1 | {load_failed,
teslamate-1 | "Failed to load NIF library /opt/app/lib/wx-2.4.2/priv/erl_gl: 'libGLU.so.1: cannot open shared object file: No such
file or directory'"}}}}},
teslamate-1 | {kernel,start,[normal,[]]}}
teslamate-1 | in function application_master:init/3 (application_master.erl, line 143)
teslamate-1 | ancestors: [application_controller,<0.10.0>]
teslamate-1 | message_queue_len: 1
teslamate-1 | messages: [{'EXIT',<0.2472.0>,normal}]
teslamate-1 | links: [<0.2470.0>]
teslamate-1 | dictionary: []
teslamate-1 | trap_exit: true
teslamate-1 | status: running
teslamate-1 | heap_size: 987
teslamate-1 | stack_size: 29
teslamate-1 | reductions: 100
teslamate-1 | neighbours:
teslamate-1 |
teslamate-1 | =INFO REPORT==== 28-Oct-2024::08:34:00.782144 ===
teslamate-1 | application: kernel
teslamate-1 | exited: {{shutdown,
teslamate-1 | {failed_to_start_child,on_load,
teslamate-1 | {on_load_function_failed,gl,
teslamate-1 | {error,
teslamate-1 | {load_failed,
teslamate-1 | "Failed to load NIF library /opt/app/lib/wx-2.4.2/priv/erl_gl: 'libGLU.so.1: cannot open shared object file:
No such file or directory'"}}}}},
teslamate-1 | {kernel,start,[normal,[]]}}
teslamate-1 | type: permanent
teslamate-1 |
teslamate-1 | Kernel pid terminated (application_controller) ("{application_start_failure,kernel,{{shutdown,{failed_to_start_child,on_load,{on_load_function
_failed,gl,{error,{load_failed,\"Failed to load NIF library /opt/app/lib/wx-2.4.2/priv/erl_gl: 'libGLU.so.1: cannot open shared object file: No such file or
directory'\"}}}}},{kernel,start,[normal,[]]}}}")
teslamate-1 |
teslamate-1 | Crash dump is being written to: erl_crash.dump...done
Wondering:
- Does memory usage go up too, or is it only cpu usage?
On my system i saw only CPU going up for beam.smp directly after starting the teslamate container. This is also the case with pr-4300 but for a short time only, than it goes down. With the offizial release it is staying high all the time.
- What is the last thing logged?
I checked only the normal logs from the container showing no error messages
- Does teslamate appear to be working correctly regardless, or is it stuck somewhere?
Teslamate seems to be working correctly, only CPU is high for the process.
Is it correctly publishing (and republishing) MQTT values?
(unlikely to be a problem if MQTT not even configured)
I use the ABRP extension and i saw values published to it in the logs while the beam.smp has these high CPU loads.
i don't know if it helps - i'm running v1.31.0 (official image) for a day now without resource spikes (incl. mqtt to home-assistant) 🤔 but fairly powerful hardware (i5-14500).
teslamate/teslamate:latest - ~ 24h and 26 min driving
teslamate:pr-4300 - ~24h and 12 min driving
no meaningful differences here (docker 27.3.1, ubuntu 24.04.1, amd64) - so far only users on RPi are affected, right?
Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?
It could also be an ARM specific issue also with latest beam.
I propose that somebody create a pull request from the change to the Dockerfile reverted:
FROM elixir:1.16.2-otp-26 AS builder
We test that. Assuming that test succeeds, we then test:
FROM elixir:1.17.2-otp-26 AS builder
We should test the following also, but I suspect we can't because this is not a valid option:
FROM elixir:1.16.2-otp-27 AS builder
My suspicion is we will find otp-26 good and otp-27 bad, but would be good to prove this is indeed the case.
On the other hand, if the problem is the elixir version, not the erlang version, we probably should try 1.17.3 also.
Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?
It could also be an ARM specific issue also with latest beam.
Bullseye here.
Have not update my Pi in months (do not judge me ha). TeslaMate is the only thing running on it. Like others, after updating to 1.31.0, beam.vm shot up and chewed through CPU.
updated my comment with results on amd64 - works without increase in cpu time here. guess having some exploration builds as proposed by brianmay is best to quickly see what is introducing the issue?
Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?
It could also be an ARM specific issue also with latest beam.
I am on Bullseye too.
Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?
It could also be an ARM specific issue also with latest beam.
Good point, I hadn't thought of that. Armv7l, rasbian buster here.
I propose that somebody create a pull request from the change to the Dockerfile reverted
I will be the somebody :-) (as it will be a bit tricky to get CI build as we prevent it if there are changes to the docker file)
could someone affected try these three please?
image: ghcr.io/teslamate-org/teslamate:pr-4314
(1.16.3-otp-26)image: ghcr.io/teslamate-org/teslamate:pr-4315
(1.17.3-otp-26)image: ghcr.io/teslamate-org/teslamate:pr-4316
(1.17.3-otp-27)tried these three as we know that 1.16.2-otp-26 is working and 1.17.2-otp-27 is not.
@JakobLichterfeld - i've been faster 😝
ghcr.io/teslamate-org/teslamate:pr-4314 not affected, CPU load in normal ranges
ghcr.io/teslamate-org/teslamate:pr-4315 not affected, same as 4300 or 4314
ghcr.io/teslamate-org/teslamate:pr-4316 definitly affected, CPU load is junoing between 30% and 90%
I also checked again the latest official release, there the load is jumping between 90 and 140% , which is more than i can see in the 4316 testing release!
ok, so it looks like otp27 is causing the issue on arm
@baresel could you provide more details about your env and hardware (OS, Docker Version, Hardware)
OS Version: "Raspbian GNU/Linux 11 (bullseye)", Kernel: Linux 6.1.21-v8+ Docker Version: Docker version 20.10.5+dfsg1, build 55c4c88 Hardware: Raspi 4B, 4 GB RAM
Same here, 4314, 4315 are fine, 4316 has high load.
OS Version: "Raspbian GNU/Linux 11 (bullseye)", Kernel: Linux 6.1.70-v7+ Docker Version: Docker version 27.3.1, build ce12230 Hardware: Raspi 3B+, 1 GB RAM
Hello, 4314, 4315 are fine, 4316 has high load.
OS Version: "Raspbian GNU/Linux 12 (bookworm) aarch64 ", Kernel: Linux 6.1.21-v8+ Docker Version: Docker version 27.3.1, build ce12230 Hardware: Raspi 4B+, 8 GB RAM Rev 1.4
I assume the OTP27 docker image has set different compiler flags, I suggest a patch release with downgrading to 1.17.3-otp-26
, even if I could not find any negative effect beside the "high" load in shown in top
on my Raspberry Pi 3B+ (edit: corrected the otp version)
what I do find interesting (but maybe completely unrelated) - RabbitMQ is not supporting OTP 27 as it has significant performance regressions in their use cases.
https://www.rabbitmq.com/docs/which-erlang#erlang-27-support
however we seem to be affected on aarch64 only.
@JakobLichterfeld you mean 1.17.3-otp-26
?
what I do find interesting (but maybe completely unrelated) - RabbitMQ is not supporting OTP 27 as it has significant performance regressions in their use cases.
OK, thats a reason to stay with 1.17.3-otp-26
(edit: correct otp version)
why not 1.17.3-otp-26
(will update my existing PR).
@JakobLichterfeld you mean
1.17.3-otp-26
?
Yeah, sorry for the typo :-D (edited above to avoid confusion)
why not
1.17.3-otp-26
(will update my existing PR).
Yeah, 1.17.3-otp-26
We will need to update the docs, the CI, the nix flake etc. as well
We will need to update the docs, the CI, the nix flake etc. as well
I will open a branch for it.
I had a theory that maybe the problem was otp-27 doesn't work correctly with bullseye. But I think the above tests clearly show bookworm has the same issue.
Just a detail I found interesting.
https://github.com/erlang/otp/issues/8322
OTP27 has various optimizations, wonder if one of these broke things for arm. https://www.erlang.org/blog/optimizations/
Is there an existing issue for this?
What happened?
I've upgraded to v1.31.0 and after the upgrade I noticed* that there's a beam.smp process chewing up CPU time:
The system is idle currently (car is asleep, and neither teslamate nor grafana are open on any devices).
Expected Behavior
System is idle with no excessive CPU hogging.
Steps To Reproduce
No response
Relevant log output
Screenshots
No response
Additional data
No response
Type of installation
Docker
Version
v1.31.0