lloesche / valheim-server-docker

Valheim dedicated gameserver with automatic update, World backup, BepInEx and ValheimPlus mod support
https://hub.docker.com/r/lloesche/valheim-server
Apache License 2.0
1.94k stars 272 forks source link

High CPU usage #249

Closed jurney closed 3 years ago

jurney commented 3 years ago

valheim_server is consuming most of a recent i7 core when idle with no players. I guessed this was a game issue, but the game forums imply this issue was fixed in early February. Maybe some issue with the config in the docker setup?

jurney commented 3 years ago

image

A snip from top on the server with no users connected

lloesche commented 3 years ago
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
27813 root      20   0   10.2g   3.0g  59784 S  27.6   4.9   2639:10 /opt/valheim/plus/valheim_server.x86_64

This is on an old 2014 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.

Maybe share your config and we'll be able to dig into this.

jurney commented 3 years ago

Here's my docker-compose.yaml section... I think fairly vanilla:

valheim: container_name: valheim image: lloesche/valheim-server volumes:

Here's my inxi output... it's a NUC8 with an i7.

System: Host: spire Kernel: 4.15.0-137-generic x86_64 bits: 64 gcc: 7.5.0 Console: tty 0 Distro: Ubuntu 18.04.5 LTS Machine: Device: un-determined System: Intel Client Systems product: NUC8i7BEH v: J72992-306 serial: Mobo: Intel model: NUC8BEB v: J72688-306 serial: UEFI: Intel v: BECFL357.86A.0074.2019.0916.1548 date: 09/16/2019 CPU: Quad core Intel Core i7-8559U (-MT-MCP-) arch: Kaby Lake rev.10 cache: 8192 KB flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 21599 clock speeds: max: 4500 MHz 1: 625 MHz 2: 627 MHz 3: 639 MHz 4: 639 MHz 5: 691 MHz 6: 711 MHz 7: 677 MHz 8: 602 MHz Graphics: Card: Intel Device 3ea5 bus-ID: 00:02.0 Display Server: X.org 1.19.6 driver: i915 tty size: 186x59 Advanced Data: N/A for root Audio: Card Intel Device 9dc8 driver: snd_hda_intel bus-ID: 00:1f.3 Sound: ALSA v: k4.15.0-137-generic Network: Card-1: Intel Device 9df0 driver: iwlwifi bus-ID: 00:14.3 IF: wlp0s20f3 state: down mac: Card-2: Intel Ethernet Connection (6) I219-V driver: e1000e v: 3.2.6-k bus-ID: 00:1f.6 IF: eno1 state: down mac: Card-3: Aquantia AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] driver: atlantic v: 2.0.2.1-kern bus-ID: 07:00.0 IF: enp7s0 state: up speed: 10000 Mbps duplex: full mac: Drives: HDD Total Size: 1000.2GB (2.9% used) ID-1: /dev/nvme0n1 model: Samsung_SSD_970_EVO_1TB size: 1000.2GB Partition: ID-1: / size: 916G used: 27G (4%) fs: ext4 dev: /dev/nvme0n1p2 RAID: No RAID devices: /proc/mdstat, md_mod kernel module present Sensors: System Temperatures: cpu: 46.0C mobo: 27.8C Fan Speeds (in rpm): cpu: N/A Info: Processes: 326 Uptime: 1 day Memory: 3971.6/32037.2MB Init: systemd runlevel: 5 Gcc sys: N/A Client: Shell (bash 4.4.201) inxi: 2.3.56

Thanks for taking a look, appreciate the work on the container.

lloesche commented 3 years ago

Just checked Passmark and your quad core CPU has a total bechmark score of 8921 with a single thread score of 2599 whereas my 6 core Xeon has a total of 7971 and a single thread score of 1700.

So I'd expect you to see maybe around 20% CPU load.

Your config looks fine, nothing out of the ordinary. Maybe try setting the server public and see if that makes any difference. The reason why it might is because private servers seem to use a different networking library (Iron Gate's own) than public servers (Steam's networking library). Although I would be very surprised. I just tested and it made zero difference for me.

Valheim server is a blackbox so it's hard to debug. But we can ask the Kernel to tell us what the server is doing while idle.

One thing that's fairly low effort is to use strace. Either inside the container or on the host. If inside the container modify your compose file like so:

    cap_add:
      - sys_nice
      - sys_ptrace

The sys_nice should already be there, just add sys_ptrace as a capability.

Then run inside the container:

apt update && apt -y install strace
strace -f -p $(< /var/run/valheim-server.pid) -c

for say 2 minutes and then abort it. This will produce statistics that should look something like this:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- -----------------------
 78.85  291.469476         367    792429    366669 futex
 10.37   38.342466        2055     18652           nanosleep
  5.59   20.664232        5799      3563           poll
  3.50   12.920542       10940      1181           epoll_wait
  1.03    3.810021      200527        19         8 restart_syscall
  0.65    2.391130          43     55045           sched_yield
  0.02    0.055841          22      2525      2352 read
  0.00    0.001167          19        60           prctl
  0.00    0.000563          43        13           sendto
  0.00    0.000174          19         9           write
  0.00    0.000171          14        12           stat
  0.00    0.000167          12        13           recvfrom
  0.00    0.000138          11        12           clock_gettime
  0.00    0.000132          44         3           sigaltstack
  0.00    0.000113          56         2           munmap
  0.00    0.000098          49         2           madvise
  0.00    0.000092          30         3           sendmsg
  0.00    0.000087          29         3           mprotect
  0.00    0.000086           7        12           lstat
  0.00    0.000074          74         1           clone
  0.00    0.000069           5        12           geteuid
  0.00    0.000064           3        18           gettid
  0.00    0.000056          56         1           getpid
  0.00    0.000052          17         3           sched_getaffinity
  0.00    0.000035          35         1           getrusage
  0.00    0.000022          22         1           set_robust_list
  0.00    0.000017          17         1           mmap
  0.00    0.000011          11         1           sched_setscheduler
  0.00    0.000010          10         1           sched_get_priority_min
  0.00    0.000009           9         1           sched_get_priority_max
  0.00    0.000000           0         1           lseek
------ ----------- ----------- --------- --------- -----------------------
100.00  369.657115         423    873600    369029 total

Let's see if yours look much different from what my server produces.

Another thing you could try is run top on the host instead of inside the container. Like my thought here is, did you maybe unintentionally restrict the containers CPU usage? If you were to only give the container access to e.g. 25% of a core then inside the container it would look like the server is using more CPU of that core than it really is.

And lastly see if your CPU is actually running at full speed (I bet it is, but you never know). Your sensor output shows the CPU at 46C which seems pretty chill for a CPU under load. I'd expect more like 70-90C. My thought here is similar to the one where the container doesn't have full CPU access. If the CPU is running in some sort of power saving / eco mode at like half clock rate then the server process would again look as if it is using more of that core than it really is.

You can check like so:

[lukas@bigmac ~]$ lscpu | grep MHz
CPU MHz:             1200.133
CPU max MHz:         3200.0000
CPU min MHz:         1200.0000

The CPU MHz line shows the current clock speed.

Or depending on your system also:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

If these steps don't produce any results we can dig deeper using bpftrace. It'll get a bit more complex than strace and requires us to have debugfs mounted.

jurney commented 3 years ago

That top was from the host, not the container.

Changing to public didn't have an effect.

lscpu explains it... Chip is running at 500Mhz, so 40-60% CPU I'm seeing is not that much in the grand scheme. Seems silly it's even doing that much work, but I expect that's on the game not this container.

Thanks for all the explanation. Very helpful. I saw the godot reference in another issue, but I'll second the request for a tip jar.

lloesche commented 3 years ago

That top was from the host, not the container.

Oh then depending on your environment you might want to look into https://docs.docker.com/engine/security/userns-remap/ I guess with a private server it's not really that important but on a public one it makes sense to use subordinate UIDs/GIDs to map UID 0 of the container to some other UID on the host. I should add something to the README about that.

I'll close the issue in a day or two if nothing else comes up.

jurney commented 3 years ago

Oh, good catch... I didn't notice it was running as root. I'm no deep docker expert, so I expected PUID= and PGID= to set the user and group IDs for the container as that's how it's been configured for all my other containers. I'll check out those docs, thanks.