teslamate-org / teslamate

A self-hosted data logger for your Tesla 🚘
https://docs.teslamate.org
MIT License
6.04k stars 752 forks source link

beam.smp chewing CPU on v1.31.0 on arm #4307

Closed leewillis77 closed 3 weeks ago

leewillis77 commented 3 weeks ago

Is there an existing issue for this?

What happened?

I've upgraded to v1.31.0 and after the upgrade I noticed* that there's a beam.smp process chewing up CPU time:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1041 10000     20   0  192432  64284   4732 S 144.9   3.4   6:13.18 beam.smp

The system is idle currently (car is asleep, and neither teslamate nor grafana are open on any devices).

Expected Behavior

System is idle with no excessive CPU hogging.

Steps To Reproduce

No response

Relevant log output

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1041 10000     20   0  192432  64284   4732 S 144.9   3.4   6:13.18 beam.smp

Screenshots

No response

Additional data

No response

Type of installation

Docker

Version

v1.31.0

swiffer commented 3 weeks ago

Seems to be caused by rammitmq (quick google search) which is not utilized by teslamate. Could that be ?

leewillis77 commented 3 weeks ago

@swiffer There's nothing else running on the box other than teslamate, and the processes in question look like they're coming from docker:

systemd(1)-+-agetty(455)
           |-agetty(456)
           |-alsactl(331)
           |-avahi-daemon(344)---avahi-daemon(356)
           |-colord(1279)-+-{colord}(1280)
           |              `-{colord}(1282)
           |-containerd(454)-+-{containerd}(458)
           |                 |-{containerd}(459)
           |                 |-{containerd}(460)
           |                 |-{containerd}(461)
           |                 |-{containerd}(462)
           |                 |-{containerd}(480)
           |                 |-{containerd}(481)
           |                 |-{containerd}(482)
           |                 |-{containerd}(498)
           |                 |-{containerd}(812)
           |                 |-{containerd}(813)
           |                 `-{containerd}(814)
           |-containerd-shim(828)-+-tini(889)-+-beam.smp(1041)-+-erl_child_setup(1438)---inet_gethost(1486)---inet_gethost(1487)
           |                      |           |                |-{beam.smp}(1435)
           |                      |           |                |-{beam.smp}(1436)
           |                      |           |                |-{beam.smp}(1437)
           |                      |           |                |-{beam.smp}(1439)
           |                      |           |                |-{beam.smp}(1440)
           |                      |           |                |-{beam.smp}(1441)
           |                      |           |                |-{beam.smp}(1442)
           |                      |           |                |-{beam.smp}(1443)
           |                      |           |                |-{beam.smp}(1444)
           |                      |           |                |-{beam.smp}(1445)
           |                      |           |                |-{beam.smp}(1446)
           |                      |           |                |-{beam.smp}(1447)
           |                      |           |                |-{beam.smp}(1448)
           |                      |           |                |-{beam.smp}(1449)
           |                      |           |                |-{beam.smp}(1450)
           |                      |           |                |-{beam.smp}(1451)
           |                      |           |                |-{beam.smp}(1452)
           |                      |           |                |-{beam.smp}(1453)
           |                      |           |                |-{beam.smp}(1454)
           |                      |           |                |-{beam.smp}(1455)
           |                      |           |                |-{beam.smp}(1456)
           |                      |           |                |-{beam.smp}(1457)
           |                      |           |                `-{beam.smp}(1458)

I do have DISABLE_MQTT=true in my docker compose since I'm not using any MQTT services if that helps.

swiffer commented 3 weeks ago

Ok, rabbit is based on erlang, that's where it's coming from. I wonder if #4296 is causing this.

swiffer commented 3 weeks ago

@leewillis77 can you try sertting ERL_MAX_PORTS

https://elixirforum.com/t/elixir-erlang-docker-containers-ram-usage-on-different-oss-kernels/57251/18

leewillis77 commented 3 weeks ago

Hi;

Thanks for the suggestion. I added the following under environment in my docker-compose.yml:

- ERL_MAX_PORTS=1024

It didn't make any difference - was that the correct way to make the change? E.g.

services:
  teslamate:
    image: teslamate/teslamate:latest
    restart: always
    environment:
      - DATABASE_USER=teslamate
      - DATABASE_NAME=teslamate
      - DATABASE_HOST=database
      - DISABLE_MQTT=true
      - ERL_MAX_PORTS=1024

The thread you linked to seems to be concerned with high memory usage, but the problem I'm experiencing is high CPU, not high memory.

s56oa commented 3 weeks ago

I have the same issue. Running in a docker on Raspberry Pi. Completely new fresh install of Teslamate made today. Version 1.31.0

Posnetek zaslona 2024–10–27 ob 19 59 52
swiffer commented 3 weeks ago

@leewillis77 yes, you are right, sorry

Could one of you try

ghcr.io/teslamate-org/teslamate/teslamate:pr-4300

That is the latest before upgrading elixir / erlang

leewillis77 commented 3 weeks ago

I would if I knew how :)

swiffer commented 3 weeks ago

Just replace image in docker compose file, see here

https://docs.teslamate.org/docs/development#testing

leewillis77 commented 3 weeks ago

I tried that actually.:

$ docker compose pull
[+] Pulling 3/3
 ✘ grafana Error   context canceled                                                                                                                                                          0.5s
 ✘ teslamate Error Head "https://ghcr.io/v2/teslamate-org/teslamate/teslamate/manifests/pr-4300": denied                                                                                     0.5s
 ✘ database Error  context canceled
swiffer commented 3 weeks ago

okay, please try again with

image: ghcr.io/teslamate-org/teslamate:pr-4300

the contrib doc had an issue...

leewillis77 commented 3 weeks ago

beam.smp CPU usage is minimal (1.3%) with ghcr.io/teslamate-org/teslamate:pr-4300 vs > 100% as originally reported with the latest release.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
14533 10000     20   0  156660  84144   6616 S   1.3   4.4   0:07.23 beam.smp
swiffer commented 3 weeks ago

@leewillis77 - thanks testing. I guess someone else needs to step in but seems to be caused by the erlang/elixir update @JakobLichterfeld

For now it should be fine to stay on pr-4300 until a fix is on the horizon.

s56oa commented 3 weeks ago

I can confirm the same with image: ghcr.io/teslamate-org/teslamate:pr-4300 Version 1.30.2-dev

beam.smp CPU usage has dropped significantly to around 1,1%.

Posnetek zaslona 2024–10–27 ob 20 47 45
swiffer commented 3 weeks ago

Out of curiosity - could you check pr-4303 as well, just to make sure if it's caused by dep updates or erlang

leewillis77 commented 3 weeks ago

Out of curiosity - could you check pr-4303 as well, just to make sure if it's caused by dep updates or erlang

beam.smp CPU is high (over 100%) with pr-4303.

leewillis77 commented 3 weeks ago

I'm unfortunately not particularly knowledgeable about the erlang ecosystem, but looking back at the pstree output I posted in https://github.com/teslamate-org/teslamate/issues/4307#issuecomment-2440094006 which references inet_gethost I wonder if https://github.com/erlang/otp/issues/8961 may be related?

baresel commented 3 weeks ago

Same issue for me, i switched to pr-4300 and it fixed the problem for now.

brianmay commented 3 weeks ago

It sounds like in that linked issue the process is just hanging, not using lots of CPU.

brianmay commented 3 weeks ago

Beam is a VM. Probably will have problems debugging this with system tools, think you will need beam tools.

One such tool is the erlang observer. https://blog.sequinstream.com/how-we-used-elixirs-observer-to-hunt-down-bottlenecks/

Having said that, it may be fiddly to get this working in a Docker container. Either the observer needs to have X access somehow (e.g. https://github.com/c0b/docker-elixir/issues/19#issuecomment-257637757), or you need to run a local beam that connects to the remote beam somehow (https://stackoverflow.com/questions/42164298/using-the-erlang-observer-app-with-a-remote-elixir-phoenix-server-inside-docker). Either way likely to be fiddly.

Also: It seems that our nix flake doesn't compile in the observer. I tried some things, but so far have been failures. Later: Oh, wait, this fixes it:

diff --git a/mix.exs b/mix.exs
index d159eefd..afa1c98a 100644
--- a/mix.exs
+++ b/mix.exs
@@ -27,7 +27,7 @@ defmodule TeslaMate.MixProject do
   def application do
     [
       mod: {TeslaMate.Application, []},
-      extra_applications: [:logger, :runtime_tools]
+      extra_applications: [:logger, :runtime_tools, :wx, :observer]
     ]
   end

(solution copied from https://stackoverflow.com/a/76760603)

JakobLichterfeld commented 3 weeks ago

I can reproduce the high CPU usage, without any high CPU usage, system runs fine with multiple docker containers on Raspberry Pi 3B+ currently logging fine a charge.

JakobLichterfeld commented 3 weeks ago

Thanks for your investigation!

Later: Oh, wait, this fixes it:

--> PR: #4311

JakobLichterfeld commented 3 weeks ago

Perhaps we can disable beams busy waiting: https://stressgrid.com/blog/beam_cpu_usage/

brianmay commented 3 weeks ago

My running theory is somewhere we have a loop or recursion, and as a result of the upgrades we did we accidentally broke the exit condition, and as a result the loop is running for ever and not exiting.

Wondering:

leewillis77 commented 3 weeks ago

@JakobLichterfeld Running with pr-4311 teslamate doesn't work (is inaccessible on port 4000) and the logs are filled with this:

teslamate-1  | =CRASH REPORT==== 28-Oct-2024::08:34:00.781993 ===
teslamate-1  |   crasher:
teslamate-1  |     initial call: application_master:init/3
teslamate-1  |     pid: <0.2471.0>
teslamate-1  |     registered_name: []
teslamate-1  |     exception exit: {{shutdown,
teslamate-1  |                       {failed_to_start_child,on_load,
teslamate-1  |                        {on_load_function_failed,gl,
teslamate-1  |                         {error,
teslamate-1  |                          {load_failed,
teslamate-1  |                           "Failed to load NIF library /opt/app/lib/wx-2.4.2/priv/erl_gl: 'libGLU.so.1: cannot open shared object file: No such
 file or directory'"}}}}},
teslamate-1  |                      {kernel,start,[normal,[]]}}
teslamate-1  |       in function  application_master:init/3 (application_master.erl, line 143)
teslamate-1  |     ancestors: [application_controller,<0.10.0>]
teslamate-1  |     message_queue_len: 1
teslamate-1  |     messages: [{'EXIT',<0.2472.0>,normal}]
teslamate-1  |     links: [<0.2470.0>]
teslamate-1  |     dictionary: []
teslamate-1  |     trap_exit: true
teslamate-1  |     status: running
teslamate-1  |     heap_size: 987
teslamate-1  |     stack_size: 29
teslamate-1  |     reductions: 100
teslamate-1  |   neighbours:
teslamate-1  |
teslamate-1  | =INFO REPORT==== 28-Oct-2024::08:34:00.782144 ===
teslamate-1  |     application: kernel
teslamate-1  |     exited: {{shutdown,
teslamate-1  |                  {failed_to_start_child,on_load,
teslamate-1  |                      {on_load_function_failed,gl,
teslamate-1  |                          {error,
teslamate-1  |                              {load_failed,
teslamate-1  |                                  "Failed to load NIF library /opt/app/lib/wx-2.4.2/priv/erl_gl: 'libGLU.so.1: cannot open shared object file:
No such file or directory'"}}}}},
teslamate-1  |              {kernel,start,[normal,[]]}}
teslamate-1  |     type: permanent
teslamate-1  |
teslamate-1  | Kernel pid terminated (application_controller) ("{application_start_failure,kernel,{{shutdown,{failed_to_start_child,on_load,{on_load_function
_failed,gl,{error,{load_failed,\"Failed to load NIF library /opt/app/lib/wx-2.4.2/priv/erl_gl: 'libGLU.so.1: cannot open shared object file: No such file or
directory'\"}}}}},{kernel,start,[normal,[]]}}}")
teslamate-1  |
teslamate-1  | Crash dump is being written to: erl_crash.dump...done
baresel commented 3 weeks ago

Wondering:

  • Does memory usage go up too, or is it only cpu usage?

On my system i saw only CPU going up for beam.smp directly after starting the teslamate container. This is also the case with pr-4300 but for a short time only, than it goes down. With the offizial release it is staying high all the time.

  • What is the last thing logged?

I checked only the normal logs from the container showing no error messages

  • Does teslamate appear to be working correctly regardless, or is it stuck somewhere?

Teslamate seems to be working correctly, only CPU is high for the process.

brianmay commented 3 weeks ago

Is it correctly publishing (and republishing) MQTT values?

(unlikely to be a problem if MQTT not even configured)

baresel commented 3 weeks ago

I use the ABRP extension and i saw values published to it in the logs while the beam.smp has these high CPU loads.

swiffer commented 3 weeks ago

i don't know if it helps - i'm running v1.31.0 (official image) for a day now without resource spikes (incl. mqtt to home-assistant) 🤔 but fairly powerful hardware (i5-14500).

teslamate/teslamate:latest - ~ 24h and 26 min driving

grafik

teslamate:pr-4300 - ~24h and 12 min driving

grafik

no meaningful differences here (docker 27.3.1, ubuntu 24.04.1, amd64) - so far only users on RPi are affected, right?

brianmay commented 3 weeks ago

Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?

It could also be an ARM specific issue also with latest beam.

brianmay commented 3 weeks ago

I propose that somebody create a pull request from the change to the Dockerfile reverted:

FROM elixir:1.16.2-otp-26 AS builder

We test that. Assuming that test succeeds, we then test:

FROM elixir:1.17.2-otp-26 AS builder

We should test the following also, but I suspect we can't because this is not a valid option:

FROM elixir:1.16.2-otp-27 AS builder

My suspicion is we will find otp-26 good and otp-27 bad, but would be good to prove this is indeed the case.

On the other hand, if the problem is the elixir version, not the erlang version, we probably should try 1.17.3 also.

cwanja commented 3 weeks ago

Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?

It could also be an ARM specific issue also with latest beam.

Bullseye here.

Have not update my Pi in months (do not judge me ha). TeslaMate is the only thing running on it. Like others, after updating to 1.31.0, beam.vm shot up and chewed through CPU.

swiffer commented 3 weeks ago

updated my comment with results on amd64 - works without increase in cpu time here. guess having some exploration builds as proposed by brianmay is best to quickly see what is introducing the issue?

baresel commented 3 weeks ago

Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?

It could also be an ARM specific issue also with latest beam.

I am on Bullseye too.

JakobLichterfeld commented 3 weeks ago

Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?

It could also be an ARM specific issue also with latest beam.

Good point, I hadn't thought of that. Armv7l, rasbian buster here.

I propose that somebody create a pull request from the change to the Dockerfile reverted

I will be the somebody :-) (as it will be a bit tricky to get CI build as we prevent it if there are changes to the docker file)

swiffer commented 3 weeks ago

could someone affected try these three please?

tried these three as we know that 1.16.2-otp-26 is working and 1.17.2-otp-27 is not.

@JakobLichterfeld - i've been faster 😝

baresel commented 3 weeks ago
  • ghcr.io/teslamate-org/teslamate:pr-4314 not affected, CPU load in normal ranges

  • ghcr.io/teslamate-org/teslamate:pr-4315 not affected, same as 4300 or 4314

  • ghcr.io/teslamate-org/teslamate:pr-4316 definitly affected, CPU load is junoing between 30% and 90%

I also checked again the latest official release, there the load is jumping between 90 and 140% , which is more than i can see in the 4316 testing release!

swiffer commented 3 weeks ago

ok, so it looks like otp27 is causing the issue on arm

@baresel could you provide more details about your env and hardware (OS, Docker Version, Hardware)

baresel commented 3 weeks ago

OS Version: "Raspbian GNU/Linux 11 (bullseye)", Kernel: Linux 6.1.21-v8+ Docker Version: Docker version 20.10.5+dfsg1, build 55c4c88 Hardware: Raspi 4B, 4 GB RAM

DrMichael commented 3 weeks ago

Same here, 4314, 4315 are fine, 4316 has high load.

OS Version: "Raspbian GNU/Linux 11 (bullseye)", Kernel: Linux 6.1.70-v7+ Docker Version: Docker version 27.3.1, build ce12230 Hardware: Raspi 3B+, 1 GB RAM

plutonn commented 3 weeks ago

Hello, 4314, 4315 are fine, 4316 has high load.

OS Version: "Raspbian GNU/Linux 12 (bookworm) aarch64 ", Kernel: Linux 6.1.21-v8+ Docker Version: Docker version 27.3.1, build ce12230 Hardware: Raspi 4B+, 8 GB RAM Rev 1.4

JakobLichterfeld commented 3 weeks ago

I assume the OTP27 docker image has set different compiler flags, I suggest a patch release with downgrading to 1.17.3-otp-26, even if I could not find any negative effect beside the "high" load in shown in top on my Raspberry Pi 3B+ (edit: corrected the otp version)

swiffer commented 3 weeks ago

what I do find interesting (but maybe completely unrelated) - RabbitMQ is not supporting OTP 27 as it has significant performance regressions in their use cases.

https://www.rabbitmq.com/docs/which-erlang#erlang-27-support

however we seem to be affected on aarch64 only.

swiffer commented 3 weeks ago

@JakobLichterfeld you mean 1.17.3-otp-26 ?

JakobLichterfeld commented 3 weeks ago

what I do find interesting (but maybe completely unrelated) - RabbitMQ is not supporting OTP 27 as it has significant performance regressions in their use cases.

OK, thats a reason to stay with 1.17.3-otp-26 (edit: correct otp version)

swiffer commented 3 weeks ago

why not 1.17.3-otp-26 (will update my existing PR).

JakobLichterfeld commented 3 weeks ago

@JakobLichterfeld you mean 1.17.3-otp-26 ?

Yeah, sorry for the typo :-D (edited above to avoid confusion)

why not 1.17.3-otp-26 (will update my existing PR).

Yeah, 1.17.3-otp-26

We will need to update the docs, the CI, the nix flake etc. as well

JakobLichterfeld commented 3 weeks ago

We will need to update the docs, the CI, the nix flake etc. as well

I will open a branch for it.

brianmay commented 3 weeks ago

I had a theory that maybe the problem was otp-27 doesn't work correctly with bullseye. But I think the above tests clearly show bookworm has the same issue.

Just a detail I found interesting.

brianmay commented 3 weeks ago

https://github.com/erlang/otp/issues/8322

OTP27 has various optimizations, wonder if one of these broke things for arm. https://www.erlang.org/blog/optimizations/