Closed cyrinux closed 2 weeks ago
I'll add a comment here.
I have several VMs with Sablier. I notice an increase in its cpu consumption without apparent reason.
When I restart Sablier, its cpu consumption returns to normal.
Hello @kraoc, do you have more inputs on that ?
If you have maybe CPU usage graphs, or specific events after which you noticed that ?
I will investigate this issue for sure when I have time, but more input to narrow down the issue will help!
I'll try to add more usefull debug tips :p (but I need to dig a bit)
My first think point to a potential leak on first start (eg. on a vm boot) ; because after a restart of sablier it's cpu consumption is ok. Dunno if I'm making myself clear :p
I'm not talking of Traefik plugin but acouvreur/sablier:latest (maybe a precision)
I have the same issue. I am running in Docker Desktop on Ubuntu, with the CPU limit under "Resources" set to 4.
After restarting the container the CPU usage is almost zero. But about 4 minutes later it ramps up to just over 100% (the maximum available is 400%). This happens whether or not I send any requests through Sablier. The usage is coming from the main Sablier container, not the proxy (I use Caddy).
It seems to work fine even when the CPU usage is high. I can't see anything useful in the logs, but let me know if there's anything else I can provide.
I still have the bug too, even with the last version, always same behavior, after restart its cool. And when I hit one of the endpoint manage by sablier then the process consumne lot of cpu. I tried to tweak some sablier params but this dont really help.
I'd be curious to have more information to try to reproduce the issue. What I'd need would be the following:
sablier version
, even with docker)I've tried to reproduce the issue but even after multiple calls, I cannot go over 0.3% CPU usage.
I'm currently running a version of the program with CPU profile, memory profiling and trace profiling.
If I can better reproduce your use cases, I'll be able to better understand why it behaves this way.
Thanks!
Hi @acouvreur,
For me same It's always https://github.com/acouvreur/sablier/issues/272#issue-2225340281 but with last sablier version.
Some self-hosted service on a synology DS918+ with an Intel(R) Celeron(R) CPU J3455 @ 1.50GHz.
I have around 60 containers running, if the docker events flow matter.
(version=1.7.0, branch=HEAD, revision=0f58751390f2439f03c851f76ef9b9a410346b58)
I have config in config in dynamic-config.toml and some labels in my docker-compose.yaml
Here few extract, I have 4 endpoint managed by sablier, I try playing with timeout and session time, for sure, sometimes, the problem doesnt trigger instant, but often this happens instantly.
dynamic-config.toml
...
[http.routers.gethomepage]
rule = "Host(`foo.bar`)"
service = "gethomepage"
entryPoints = ["https"]
middlewares = ["middleflare@docker", "sablier-gethomepage@docker", "authelia@docker", "compress@docker", "security-headers@docker"]
[http.routers.gethomepage.tls]
certresolver = "default"
options = "default"
...
docker-compose.yml
...
sablier:
image: acouvreur/sablier:1.7.0
container_name: sablier
mem_limit: 256m
cpu_shares: 1024
networks:
- traefik
environment:
- PROVIDER_NAME=docker
labels:
com.centurylinklabs.watchtower.enable: "true"
deunhealth.restart.on.unhealthy: "true"
traefik.enable: "true"
# Group Middleware aur
traefik.http.middlewares.sablier-aur.plugin.sablier.group: aur
traefik.http.middlewares.sablier-aur.plugin.sablier.sablierUrl: http://sablier:10000
traefik.http.middlewares.sablier-aur.plugin.sablier.sessionDuration: 720m
traefik.http.middlewares.sablier-aur.plugin.sablier.blocking.timeout: 30s
# Group Middleware gethomepage
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.group: gethomepage
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sablierUrl: http://sablier:10000
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sessionDuration: 120m
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.theme: ghost
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.showDetails: "false"
gethomepage:
image: ghcr.io/gethomepage/homepage
container_name: gethomepage
volumes:
- /volume2/docker/homepage/config:/app/config:rw
networks:
- traefik
depends_on:
- traefik
restart: unless-stopped
labels:
com.centurylinklabs.watchtower.enable: "true"
deunhealth.restart.on.unhealthy: "false"
sablier.enable: true
sablier.group: gethomepage
traefik.enable: "true"
traefik.docker.network: traefik
...
I've done the following experiment:
Create a container to be scaled:
docker run -d -l sablier.enable=true --name whoami containous/whoami
Run Sablier
docker run -v '/var/run/docker.sock:/var/run/docker.sock' --cpu-shares=1024 -m 256m --name sablier -p 10000:10000 acouvreur/sablier:1.7.0 start --logging.level=trace
Observe stats
docker stats
Generate load
for i in {1..500}; do
curl "http://localhost:10000/api/strategies/dynamic?names=whoami&session_duration=30s" &
curl "http://localhost:10000/api/strategies/blocking?names=whoami&session_duration=30s" &
curl "http://localhost:10000/api/strategies/blocking?group=default&session_duration=30s" &
curl "http://localhost:10000/api/strategies/dynamic?group=default&session_duration=30s" &
done
This will do 2k requests with names and group matching with both strategies.
https://github.com/acouvreur/sablier/assets/22034450/815f1198-ad31-4be6-ba92-85f1d00c3372
Still I cannot reproduce the CPU usage.
Note that I've set the same resource restrictions that you have.
I've done some testing on my setup, and I can only reproduce it when I use a docker socket proxy. Looks like there's another issue open in the same area: https://github.com/acouvreur/sablier/issues/287
I've done some testing on my setup, and I can only reproduce it when I use a docker socket proxy. Looks like there's another issue open in the same area: #287
Ok, I confess I cleanup my config a little for pasting it in the previous post, and yes, I use a cetusguard proxy too ! Good catch !
I've done the following experiment:
1. Create a container to be scaled:
docker run -d -l sablier.enable=true --name whoami containous/whoami
2. Run Sablier
docker run -v '/var/run/docker.sock:/var/run/docker.sock' --cpu-shares=1024 -m 256m --name sablier -p 10000:10000 acouvreur/sablier:1.7.0 start --logging.level=trace
3. Observe stats
docker stats
4. Generate load
for i in {1..500}; do curl "http://localhost:10000/api/strategies/dynamic?names=whoami&session_duration=30s" & curl "http://localhost:10000/api/strategies/blocking?names=whoami&session_duration=30s" & curl "http://localhost:10000/api/strategies/blocking?group=default&session_duration=30s" & curl "http://localhost:10000/api/strategies/dynamic?group=default&session_duration=30s" & done
This will do 2k requests with names and group matching with both strategies. Recording.2024-06-27.145655.1.mp4
Still I cannot reproduce the CPU usage.
Note that I've set the same resource restrictions that you have.
Without doubt it works well in your lab, so I understand i'm facing docker socket proxy issue too.
I can try with the docker socket proxy. It might be related to Sablier listening for incoming events. Can you share your setup with a docker socket proxy ?
cetusguard-sablier:
container_name: cetusguard-sablier
image: docker.io/hectorm/cetusguard
read_only: true
security_opt:
- no-new-privileges:true
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:ro"
labels:
com.centurylinklabs.watchtower.enable: "true"
deunhealth.restart.on.unhealthy: "true"
networks:
- cetusguard-sablier
environment:
CETUSGUARD_BACKEND_ADDR: "unix:///var/run/docker.sock"
CETUSGUARD_FRONTEND_ADDR: "tcp://:2375"
CETUSGUARD_RULES: |
! Monitor events
GET %API_PREFIX_EVENTS%
! Get system information
GET %API_PREFIX_INFO%
! Get data usage information
GET %API_PREFIX_SYSTEM%
! List containers
GET %API_PREFIX_CONTAINERS%/json
! Inspect a container
GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/json
! Start stop container
POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/start
POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stop
GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stats
CETUSGUARD_LOG_LEVEL: "2"
restart: always
sablier:
image: acouvreur/sablier:1.7.0
container_name: sablier
mem_limit: 256m
cpu_shares: 1024
networks:
- traefik
- cetusguard-sablier
environment:
- PROVIDER_NAME=docker
- DOCKER_HOST=tcp://cetusguard-sablier:2375
labels:
com.centurylinklabs.watchtower.enable: "true"
deunhealth.restart.on.unhealthy: "true"
traefik.enable: "true"
# Group Middleware aur
traefik.http.middlewares.sablier-aur.plugin.sablier.group: aur
traefik.http.middlewares.sablier-aur.plugin.sablier.sablierUrl: http://sablier:10000
traefik.http.middlewares.sablier-aur.plugin.sablier.sessionDuration: 720m
traefik.http.middlewares.sablier-aur.plugin.sablier.blocking.timeout: 30s
# Group Middleware gethomepage
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.group: gethomepage
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sablierUrl: http://sablier:10000
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sessionDuration: 120m
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.theme: ghost
traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.showDetails: "false"
So, I've tried with the docker socket proxy with the following setup:
version: "3.7"
services:
cetusguard-sablier:
container_name: cetusguard-sablier
image: docker.io/hectorm/cetusguard
read_only: true
security_opt:
- no-new-privileges:true
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:ro"
environment:
CETUSGUARD_BACKEND_ADDR: "unix:///var/run/docker.sock"
CETUSGUARD_FRONTEND_ADDR: "tcp://:2375"
CETUSGUARD_RULES: |
! Monitor events
GET %API_PREFIX_EVENTS%
! Get system information
GET %API_PREFIX_INFO%
! Get data usage information
GET %API_PREFIX_SYSTEM%
! List containers
GET %API_PREFIX_CONTAINERS%/json
! Inspect a container
GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/json
! Start stop container
POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/start
POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stop
GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stats
CETUSGUARD_LOG_LEVEL: "2"
sablier:
image: acouvreur/sablier:1.7.0
ports:
- 10000:10000
environment:
- PROVIDER_NAME=docker
- DOCKER_HOST=tcp://cetusguard-sablier:2375
whoami:
image: containous/whoami:v1.5.0
labels:
- sablier.enable=true
And still after a lot of http calls:
d64a6122dad7 cetusguard-sablier 0.00% 31.59MiB / 7.744GiB 0.40% 789kB / 2.1MB 0B / 0B 9
9bd01d6419c6 sablier-whoami-1 0.00% 2.629MiB / 7.744GiB 0.03% 1.31kB / 0B 0B / 0B 4
eb3ee597b947 sablier-sablier-1 0.00% 50.02MiB / 7.744GiB 0.63% 4.49MB / 12MB 455kB / 4.1kB 12
Docker sock config and proxy config are both commented - uncomment one at a time. Also docker.sock.raw is for Docker Desktop, use docker.sock otherwise
name: sablier2
services:
proxy2:
restart: always
image: caddy:2.8.4-with-sablier # built as per sablier docs
networks:
- network2
ports:
- "8000:80" # whoami
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile:ro
whoami2:
restart: none
image: traefik/whoami
networks:
- network2
labels:
- sablier.enable=true
- sablier.group=demo
sablier2:
restart: always
image: acouvreur/sablier
networks:
- network2
- sablier-socket-network2
# volumes:
# - '/var/run/docker.sock.raw:/var/run/docker.sock'
# environment:
# - DOCKER_HOST=tcp://socket-proxy-sablier2:2375
socket-proxy-sablier2:
image: lscr.io/linuxserver/socket-proxy:latest
environment:
- ALLOW_START=1
- ALLOW_STOP=1
- ALLOW_RESTARTS=1
- CONTAINERS=1
- EVENTS=1
- INFO=1
- NETWORKS=1
- PING=1
- POST=1
- SERVICES=1
- VERSION=1
volumes:
- /var/run/docker.sock.raw:/var/run/docker.sock:ro
restart: always
read_only: true
tmpfs:
- /run
networks:
- sablier-socket-network2
networks:
network2:
name: sablier-network2
sablier-socket-network2:
name: sablier-socket-network2
I can create a discord server for the project so we can actually debug this together, what do you think ?
I need something with voice channels, do you have any suggestions ?
Docker sock config and proxy config are both commented - uncomment one at a time. Also docker.sock.raw is for Docker Desktop, use docker.sock otherwise
name: sablier2 services: proxy2: restart: always image: caddy:2.8.4-with-sablier # built as per sablier docs networks: - network2 ports: - "8000:80" # whoami volumes: - ./Caddyfile:/etc/caddy/Caddyfile:ro whoami2: restart: none image: traefik/whoami networks: - network2 labels: - sablier.enable=true - sablier.group=demo sablier2: restart: always image: acouvreur/sablier networks: - network2 - sablier-socket-network2 # volumes: # - '/var/run/docker.sock.raw:/var/run/docker.sock' # environment: # - DOCKER_HOST=tcp://socket-proxy-sablier2:2375 socket-proxy-sablier2: image: lscr.io/linuxserver/socket-proxy:latest environment: - ALLOW_START=1 - ALLOW_STOP=1 - ALLOW_RESTARTS=1 - CONTAINERS=1 - EVENTS=1 - INFO=1 - NETWORKS=1 - PING=1 - POST=1 - SERVICES=1 - VERSION=1 volumes: - /var/run/docker.sock.raw:/var/run/docker.sock:ro restart: always read_only: true tmpfs: - /run networks: - sablier-socket-network2 networks: network2: name: sablier-network2 sablier-socket-network2: name: sablier-socket-network2
Ok, actually, thanks @luked34 I can actually reproduce the issue now!
89509d81c1ec sablier-sablier-1 100.17% 11.55MiB / 7.744GiB 0.15% 26.7kB / 18.6kB 889kB / 4.1kB 8
I can create a discord server for the project so we can actually debug this together, what do you think ?
I need something with voice channels, do you have any suggestions ?
I would have suggested discord, might be worth setting one up even if we don't use it for this issue
I think I found the issue, but the issue seems to happen when the event stream returns a nil channel.
In such case the event stream cannot be properly initialized and the select
statement read nil values indefinitely.
I can make a fix to properly break out of the faulty event stream, however, this means that there is something wrong with your setup as we can't properly listen to the event stream.
I'll publish a fix in the beta
branch, please test the fix when it's released and look for errors in the logs for details in your specific cases.
So basically what happens is the client is registered without making sure the docker connection is established. It tries to listen for events, but it returns closed channels and spin up CPU high like in a for loop.
But it works afterwards because, by the time being, the connection was established.
I've changed this behavior so sablier will actually exit if the connection cannot be established.
You've all probably had some kind of race in which the connection was unavailable to the docker host at the time of registering the event stream.
This ended up burning the CPU, for which I've also added a fail-sage.
The fix is available in v1.7.1-beta.1, let me know if still happens.
provider:
name: docker
server: port: 10000 base-path: /
sessions: default-duration: 5m expiration-interval: 30s
logging: level: warn
strategy: dynamic: show-details-by-default: false default-theme: zogg default-refresh-frequency: 5s blocking: default-timeout: 5m
* endpoint ? none... simple Proxmox/Dozzle monitoring...
* Usage: sablier container + sablier Traefik plugin
Maybe important points...
* Seems to consume cpu only after vm/docker/container start (eg. in case of a vm restart)
* usage of wollomatic/socket-proxy:1
So, maybe in a vm restart condition Sablier (which depend on socket proxy) is started before the socket proxy.
This may explain when I restart sablier manually it works fine as proxy is already started...
_Sorry for late reply :p_
Thanks for the quick fix, it seems to be working.
The log now shows two extra messages:
I've tried adding a depends_on to the compose file, restarting Sablier while leaving the socket proxy running, and changing the socket proxy permissions to allow everything, but the messages still appear.
CPU usage is consistently below 1% now, so I'm happy to ignore the messages :)
Maybe using willfarrell/autoheal with setting to unhealthy Sablier container when no socket proxy available should maid the trick ? Relying on autoheal ability to restart containers ? Just a suggestion... :p
The fix is available in v1.7.1-beta.1, let me know if still happens.
I will check again later but at first glance it's far away better! Thank you!
Maybe using willfarrell/autoheal with setting to unhealthy Sablier container when no socket proxy available should maid the trick ? Relying on autoheal ability to restart containers ? Just a suggestion... :p
When Sablier cannot connect to the docker host, it should fail and exit.
If you simply add the option restart: always
you will have a default kind of autoheal.
Already has always :p But all my compose use inheritance (multi-level extends) so my setup is somewhat unusual I think ^^
First thanks for this cool toy 😃
Describe the bug High CPU usage
Context
Expected behavior Low cpu usage
Additional context I manage 2 services in a big docker-compose files with sablier, and when I do a top I can see sablier process very hard with my tiny cpu. When I restart it, the process become cool, until an unknown event make it hard with the cpu.
Is it a know issue ? I can see there is a full rewrite PR in progress, this can help ? (I didnt dig in the code actually)