lloesche / valheim-server-docker

Valheim dedicated gameserver with automatic update, World backup, BepInEx and ValheimPlus mod support
https://hub.docker.com/r/lloesche/valheim-server
Apache License 2.0
1.94k stars 272 forks source link

Each update grows the utilization of RAM #565

Open manubissanchez opened 1 year ago

manubissanchez commented 1 year ago

Hello,

After each update, a little more RAM is used and it grows until memory is full (16Go). I have to reboot for better performance. Is the same for everyone ? What can I do to avoid this ?

Thanks !

Stupco commented 1 year ago

Don't think this is a "Question" but an actual issue, just worded poorly?

Also having this issue. Though mine seem to be with every backup.

Attached logs and also RAM usage from my Docker container.

If I leave Valheim running in Docker, it crashes the whole LXC... Which is no beuno and does not help a 24/7 server.

As you cans see from this screenshot, at around 10:36 (server time) image

The logs are offset by 1hr (due to daylight savings), so at 11:36 it looks like the World backup also starts. The issue is that the RAM increases exponentially with every backup?

_valheim_logs (1).txt

Running in a LXC within ProxMox. Have stopped all containers and the RAM increase happens exponentially only when Valheim is running.

Here is an image showing the LXC container crashing after every 30min backup increasing RAM until it crashes: image

Stupco commented 1 year ago

Confirming issue by running Portainer Stat page during backup and you see RAM increase during backup and then remain at the higher level. This would mean each backup would exponentially add to RAM usage: image

manubissanchez commented 1 year ago

Hello, thanks for your reply. I use proxmox too, I saw exactly the same thing. But since the last update (3 days ago) and UPDATE_CRON="/60 *" the increase in RAM usage is very limited. I keep an eye on it anyway.

edit : maybe the "" cancel updates... I restart without this environment variable.

Stupco commented 1 year ago

Yeah I realised that my backups were running every 30mins but after reading the docs they should have defaulted to every hour???

Regardless, I've now swapped over to once daily with the CRON varialable.

Still, if running backups is for some reason adding RAM usage after every run, then a fix will still be required or we'll still have to be restarting the container every so often...

I also don't know what you mean by your edit? "" cancel updates

manubissanchez commented 1 year ago

I mean, maybe the quotation marks were too much and negated the updates. So the last update of valheim-server-docker do not solve the issue.

The issue is still here : zefzefzefzefzefsd

skykanin commented 1 year ago

I seem to be running in the same issue when the docker image does a world save. As you can see in the graph it causes a slight uptick in CPU and RAM usage during the backup. The CPU utilization goes back to normal, but the memory is never freed unless I restart the docker image. image

fuse1985 commented 1 year ago

same for us. Yesterday the server crashed for the first time due to 100% RAM usage. its starting with around 3,8GB and go up for about ~1GB per step.

opello commented 1 year ago

Do you know where the world save is landing? If it's in some tmpfs this might make sense.

Otherwise, with such a reproducible problem, it should be pretty simple to track down. Take some snapshots of usage (from one of /proc/<pid>/{stat,status,statm} where status is most approachable) and compare over time.

Zedifuu commented 1 year ago

Thanks for the role since I've talked on this as well but I am also facing the same issue. Mitigate it in the mean time of essentially just set up a cronjob to restart the container every week. As you can imagine that's not ideal...

When I have some time I'll exec in and poke about..

lloesche commented 1 year ago

I've tried to replicate the issue but couldn't. Can you see if there's a process that's getting stuck with each backup? Do you have any pre/post backup hooks that might hang?

For memory to leak there has to either be a continuously running process that's allocating more memory than it is freeing, or there have to be new processes occupying memory that are adding up with each backup run.

Looking at the backup script, the list of external commands it is calling are zip and cp for the backup and find, sort, cut, tail, xargs and rm for the cleanup of old backups.

Steps that would help with debugging the issue, next time it happens:

  1. exec into the container and get a process list. See if there's an excessive amount of any of the before mentioned external commands running.
  2. identify which of the processes running inside the container are occupying which amount of core memory. This can be done inside the container or on the host system.

I also went over the valheim-backup script itself but couldn't see any obvious spot where it would leak memory. As far as I can tell there is no point where it appends any data or otherwise uses a variable that grows in size. But I would appreciate a second pair of eyes. The script is only like 160 loc and fairly straightforward.

fuse1985 commented 1 year ago

we disabled the valheim backup function and only use the backup function coming with this docker image. but its still using a lot of memory...

manubissanchez commented 1 year ago

Hello :)

After monitoring for 2 days, I can tell that's the "buff/cache" wich grows according to the "free" command. The used mem is stable : 3Gi OS : Ubuntu Server 22.04.1

opello commented 1 year ago

Then you should monitor for growth in /proc/meminfo and see which entry grows. And also review the meminfo section of proc.txt for more detail.

But the conventional wisdom is that if it's in buffers and cache, it's not really "used" because the kernel can give it to another process. See also: free(1) which explains the fields, and then from proc.txt above:

     Buffers: Relatively temporary storage for raw disk blocks
              shouldn't get tremendously large (20MB or so)
      Cached: in-memory cache for files read from the disk (the
              pagecache).  Doesn't include SwapCached

So if there's really a leak and it's only manifesting as buff/cache from free growing, I don't think you've found the indicator.

opello commented 1 year ago

@fuse1985 if you look in the running Docker container, and your backups are going to /backups, what does df -h /backups show? (or wherever your backups are going)

fuse1985 commented 1 year ago

@fuse1985 if you look in the running Docker container, and your backups are going to /backups, what does df -h /backups show? (or wherever your backups are going)

backup

thats my backup settings btw, so only 1 backup a day and no backups from the game settings: BACKUPS_CRON=0 3 * * * BACKUPS_MAX_AGE=3 SERVER_ARGS=-backups 0 -saveinterval 1200

And those 4 backups are about ~150MB each.

opello commented 1 year ago

I was thinking it could be a tmpfs, but doesn't seem to be... Seems like you must also be setting BACKUPS_DIRECTORY if it's not /config/backups though?

manubissanchez commented 1 year ago

I added the "BACKUPS_MAX_COUNT=3" environment variable and now the RAM usage is stable around 3Gio. So I conclude that each backup was keep in cache in some way.

opello commented 1 year ago

If that's the case you should be able to recover the memory using /proc/sys/vm/drop_caches:

       /proc/sys/vm/drop_caches (since Linux 2.6.16)
              Writing to this file causes the kernel to drop clean
              caches, dentries, and inodes from memory, causing that
              memory to become free.  This can be useful for memory
              management testing and performing reproducible filesystem
              benchmarks.  Because writing to this file causes the
              benefits of caching to be lost, it can degrade overall
              system performance.

              To free pagecache, use:

                  echo 1 > /proc/sys/vm/drop_caches

              To free dentries and inodes, use:

                  echo 2 > /proc/sys/vm/drop_caches

              To free pagecache, dentries, and inodes, use:

                  echo 3 > /proc/sys/vm/drop_caches

              Because writing to this file is a nondestructive operation
              and dirty objects are not freeable, the user should run
              [sync(1)](https://man7.org/linux/man-pages/man1/sync.1.html) first.

[1] https://man7.org/linux/man-pages/man5/proc.5.html

jonvel commented 1 year ago

This is the result of running the above that opello suggests (taken from portainer GUI):

Post_dropcache

ie:

# sync
# echo 3 > /proc/sys/vm/drop_caches

Note that this was run on the host system, not the container. Note that the container I'm running is running with the defaults (internal backups every 15 minutes, create worlds-YYYYmmdd-HHMMSS.zip backup every 60 minutes), and running for about 2 days-ish. Note that the game db is roughly about 50MB give or take.

Note also that the performance of the system seemed fine. The system never reported that all that much RAM was being consumed on the host (free never showed more than 7-8 gigs of usage, not bad considering its hosting 2 Valheim services and 3 Minecraft worlds), and the physical storage isn't too much either.

ciphersimian commented 1 year ago

It's completely normal and advantageous for Linux to use memory for caching when there is memory available. If something needed this memory it would be made available to it. Dropping the caches when you aren't doing some kind of testing where you don't want the effects of caching to affect your results is just going to hurt performance for no reason.

opello commented 1 year ago

If dropping the caches recovers the memory thought to be going to a leak it's safe to do nothing going forward.

jonvel commented 1 year ago

While I agree that it's advantageous for a Linux system to fully utilize for buffers/cache any available RAM, I find it odd that the amount or RAM consumed continues to increase in lockstep with the valheim-updater service in the container that's triggered every (by default) 15 minutes. ~Surely, there can't be any appreciable amount of performance gained by the Linux system for caching these .zip files every 15 minutes+?~ (EDIT: removed this, line, as I was mis-reading the cron ENV var I setup - the 15 minutes is how frequently the updater runs, not the backup services) I'll let the system run for about a week+ and see what the buff/cache looks like after that time period. Though based on relevant sar -r data I've been tracking for a few days, the buff/cache size isn't really increasing all that much I'll also see how long it takes to run out of RAM (Free+Swap) and the system crashes, as that appears to be the case for at least some people in this thread.

Besides, if this was an issue with the host consuming more buff/cache, then this wouldn't show up through the container stats view in portainer. The other containers that I'm running on the host (a couple of Minecraft docker containers) are better-behaved, and don't exhibit the steady, continuous rise in RAM usage as reported by the container in 15 minute increments exactly corresponding with the valheim-updater process, even when actively used. Either way, I'll actively not maintain the server (ie no stop/start of the valheim container) for the next 2 weeks and hopefully remember to report back here what I've found.

opello commented 1 year ago

@jonvel Please also collect some additional data:
docker stats --no-stream --no-trunc <container>

I imagine this is an artifact of an older portainer, prior to v1.20.0 which introduced separating out the cache from the memory usage:

Since docker stats ignores caches having that along with the other data collected should be helpful in confirming.


Surely, there can't be any appreciable amount of performance gained by the Linux system for caching these .zip files every 15 minutes+?

On a system with low file I/O it seems reasonable that any write that can be held would be on the off chance it was read again, especially when it can easily be evicted.

Besides, if this was an issue with the host consuming more buff/cache, then this wouldn't show up through the container stats view in portainer.

Per https://github.com/portainer/portainer/issues/2380 it depends on the version. Based on your graph I expect yours is prior to v1.20.0 since it doesn't have the cache separate on the graph. The drop_caches affecting the graph based on your earlier comment also supports this conclusion.

As for your Minecraft containers, they probably aren't writing out new files but updating existing ones? I guess I'm not real familiar with how Minecraft persists changes at the server.

jonvel commented 1 year ago

@opello - no, running the portainer container from about 2 weeks ago (v2.19.0, but Community Edition, because I'm cheap). However, the interesting parts here are:

CONTAINER ID        NAME              CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O         PIDS
3da1bdd5e346c1adf   valheim.service   21.78%    4.627GiB / 30.71GiB   15.07%    8.29GB / 1.7GB   16.4GB / 37.3GB   64

Note that one of these days, I'll do better tuning to restrict how much RAM the container can use, but I don't do this professionally, and this is really only on a whim to learn more about containerization. The Minecraft services were mostly irrelevant, other than to point that I found it odd that so much "RAM" was being consumed. But that's fine. I'll let it sit for a while longer, and see what the stats look like in the future.

Also - I'll edit my response above, but the steps up in usage occur not when a backup is created but when the valheim-updater process runs (once every 15 minutes).

opello commented 1 year ago

@jonvel, sorry for the misread, I was stuck thinking about backups and you'd said it was the updater.

This is starting to make sense after reviewing the valheim-updater code and the stdout log. When it runs it does a few things, most relevant to this issue is asking steamcmd.sh to +app_update but it also passes STEAMCMD_ARGS which according to the README.md (and the behavior in my log anyway) includes validate:

INFO - Downloading/updating/validating Valheim server from Steam
Redirecting stderr to '/home/valheim/Steam/logs/stderr.txt'
Logging directory: '/home/valheim/Steam/logs'
[  0%] Checking for available updates...
[----] Verifying installation...
Steam Console Client (c) Valve Corporation - version 1694466999
-- type 'quit' to exit --
Loading Steam API...dlmopen steamservice.so failed: steamservice.so: cannot open shared object file: No such file or directory
OK

Connecting anonymously to Steam Public...OK
Waiting for client config...OK
Waiting for user info...OK
 Update state (0x5) verifying install, progress: 0.14 (2097152 / 1515942651)
 Update state (0x5) verifying install, progress: 21.70 (329012463 / 1515942651)
 Update state (0x5) verifying install, progress: 52.05 (789104462 / 1515942651)
 Update state (0x5) verifying install, progress: 84.08 (1274660014 / 1515942651)
Success! App '896660' fully installed.
.d..t...... ./
INFO - Valheim Server is already the latest version

https://developer.valvesoftware.com/wiki/SteamCMD#Downloading_an_App

To also validate the app, add validate to the command.

So, during the update every 15 minutes, steamcmd is validating the download, which means it's reading the ~1.4GiB of the download directory from disk. This makes sense to load into cache and not hit the slower disk if there's available memory. I think it's "not a concern" but if you disagree you should be able to remove validate from STEAMCMD_ARGS.

I'm not sure why cache isn't separate from memory in your latest portainer version, but that drop_caches makes the number go down clearly shows that caches are included in the value.

DiskCrasher commented 10 months ago

I'm seeing similar issues where my swap is getting maxed out and slowing everything down to the point where the server connection is timing out:

Output from free -h:

              total        used        free      shared  buff/cache   available
Mem:          7.7Gi       2.9Gi       119Mi        22Mi       5.0Gi       4.8Gi
Swap:         2.0Gi       2.0Gi        33Mi

Output from top with SWAP column:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM   SWAP     TIME+ COMMAND
10889 root      20   0 9915528 373348  53976 S 10.60 4.632 1.808g  39:21.68 valheim_server.

In preliminary troubleshooting, this does appear to happen after the update check.