[feature] Track media cache size to help avoid disk space exhaustion

hikari-no-yume commented 1 year ago

Describe the bug with a clear and concise description of what the bug is.

My VPS just ran out of disk space entirely because the remote media cache got so big. I probably need to adjust the settings to set a reasonable limit on the cache size, or to make the cache be pruned more often, but I was surprised it could get so bad. Could there be some emergency procedure that makes it automatically start clearing stuff if the free disk space goes below 1% or something? That would've saved me.

What's your GoToSocial Version?

0.8.1

GoToSocial Arch

amd64

hikari-no-yume commented 1 year ago

Oh, actually, I guess what I really needed in this case was for GTS to trap “not enough disk space” errors when trying to write stuff, and start evicting things from the cache instead.

I'm not saying GTS has to do this, it's just a feature suggestion to help save servers from spontaneous combustion when misconfigured.

tsmethurst commented 1 year ago

Could there be some emergency procedure that makes it automatically start clearing stuff if the free disk space goes below 1% or something?

Mmm, I see what you mean but I don't think it's really GoToSocial's responsibility to check how much disk space is left on your server...

I was surprised it could get so bad

How bad are we talking? How much disk space ended up getting used? What is your media-remote-cache-days setting currently?

0.8.1

Better to update to a more recent version when you get the chance. The latest 0.10.0 release candidate has some fixes in it for media pruning/cacheing.

hikari-no-yume commented 1 year ago

How bad are we talking? How much disk space ended up getting used?

I don't know exactly, but the server's / drive is “24.05GB” and 99.9% of drive space was in use (such that new files couldn't be created and gts failed to store incoming messages). I went into the admin interface and told it to clear all remote media older than 0 days, and it rapidly started reducing. It's now one day later and the cache has probably started to fill again, but currently my server is at “56.5% of 24.0GB”, so it was at least 9GB of cached media.

What is your media-remote-cache-days setting currently?

It's media-remote-cache-days: 30, but I don't think this means very much. There isn't any correct value for this, since the amount of space that will correspond to is unpredictable and outside my control. Assuming the daily cleanup job was working (I haven't checked), I must have been unlucky and there was just a lot of remote media in the past 30 days. Is there no setting like “media remote cache size”? Or maybe a size limit on the overall media cache (easier to calculate), exceeding which would trigger emergency pruning? So far as I can tell there's actually no way I could have avoided the same thing happening again with certainty unless I constantly prune to zero?

Better to update to a more recent version when you get the chance. The latest 0.10.0 release candidate has some fixes in it for media pruning/cacheing.

Yeah, I'll try to do so soon. The last time I upgraded was also because of media caching issues (see the description of https://github.com/superseriousbusiness/gotosocial/issues/1713).

tsmethurst commented 1 year ago

It's media-remote-cache-days: 30, but I don't think this means very much.

It means quite a bit! If you set it down to 2 or 3 days, that will save you a lot of disk space.

I do know what you mean though: we currently don't really have a way of checking the overall size of what's being stored within GoToSocial itself. This might be a good thing to bring up in a separate feature request.

hikari-no-yume commented 1 year ago

Hmm, could I change the topic of this request to that? I think that's really what I want: an overall limit on space. Actually, my ideal setup would be something like:

# cache up to 5GB of remote media, prune oldest when full
media-remote-cache-days: 0
media-remote-cache-max-size: 5000000000

Though I don't know if GtS's design would make that practical.

tsmethurst commented 1 year ago

I think our first step would probably be to just give admins an easy way to see how much space is being used currently by storage, via some kind of basic metrics in the admin panel. Setting a hard max size is maybe an extra on top of that, but I'd start with the first thing and we can see from there.

daenney commented 1 year ago

Here's some code that lets you calculate it, at least if you're on 0.10 (which is currently on rc3):

#!/usr/bin/env python3
import argparse
import os
import pathlib
import sys

def get_dir_size(path):
    total = 0
    with os.scandir(path) as it:
        for entry in it:
            if entry.is_file():
                total += entry.stat().st_size
            elif entry.is_dir():
                total += get_dir_size(entry.path)
    return total

def main():
    cli = argparse.ArgumentParser(
        prog="disk-usage",
        description="""Get the size of the remote media cache[[s]].
        """,
        epilog="Be gay, do crimes. Trans rights!"
    )
    cli.add_argument("storageroot", type=pathlib.Path, help="same value as storage-local-base-path in your GoToSocial configuration")
    cli.add_argument("destination", nargs="?", type=pathlib.Path, help="file to write output to, or stdout if ommitted")
    args = cli.parse_args()

    output = open(args.destination, 'w') if args.destination else sys.stdout
    prefixes=set()

    total_size = get_dir_size(args.storageroot)
    output.write(f"{args.storageroot}: {total_size/(1<<20):,.0f} MB\n")

    for line in sys.stdin:
        # Skip any log lines
        if "msg=" in line:
            continue
        prefixes.add(os.path.join(os.path.sep, *line.split("/")[:-3]))

    for prefix in prefixes:
        output.write(f"{prefix}: {get_dir_size(prefix)/(1<<20):,.0f} MB\n")

    if output is not sys.stdout:
        output.close()

if __name__ == "__main__":
    main()

You can invoke it like so:

gotosocial --config-path /etc/gotosocial/config.yaml admin media list-local | ./disk-usage.py  /data/gotosocial/files

In my case that results in:

/data/gotosocial/files: 4,141 MB
/data/gotosocial/files/XXXXXXXXXXX: 4 MB

So in the end, media cache is about 4137MB for me, or 4.1GB.

mirabilos commented 1 year ago

I solved this for my VM by creating a separate filesystem for GtS storage. This also helps with backups (with --one-filesystem): they are backed up together, but to different restic snapshots (first / and /boot, then /opt/GtS/storage), but the pruning rules differ, I keep yearly/monthly/weekly/daily backups for all, but only the last two or three daily ones for GtS storage.

Sure, I did have the foresight to do that ahead of time, and converting an existing installation to that setup is kinda hard. (I tried using quotas at first, but that’s just broken currently.)

hikari-no-yume commented 1 year ago

@mirabilos I don't think separate storage for the media cache really solves the problem? It's already in its own directory and I could easily put it in its own filesystem if I wanted to, but if it becomes overfull then GTS is going to stop working, and if I were to delete it then I'd lose my local media.

mirabilos commented 1 year ago

hikari_no_yume dixit:

@mirabilos I don't think separate storage for the media cache really @solves the problem? It's already in its own directory and I could @easily put it in its own filesystem if I wanted to, but if it becomes @overfull then GTS is going to stop working, and if I were to delete it @then I'd lose my local media.

It’s not going to stop working but it is going to not accept any new media. (I actually ran into that before 0.8? 0.9? fixed the cleanup.)

If you don’t upload any media yourself until the ENOSPC situation is fixed, you’d not lose anything. Doing a remote media cleanup with days set to 1 or even 0 can do wonders in such a case.

(Of course, “stop uploading new local media” can only work for solo instances really.)

@mirabilos This issue isn't about GTS's media storage in general @though, but specifically about splitting local and remote media.

Ah okay.

That might actually be a good idea. I would also love, at the same time, if we could split the media so it has up to 256 top-level subdirectories 00‥FF¹ and the per-account directories are split between these, to keep the directory size and loading times manageable (my storage directory is currently already over 1 MiB in size, and entering it with mc stalls for up to half a minute; serving and adding new media both would also benefit from losing that speed loss).

(1) like .git/objects/{00‥FF}/

Perhaps local need not be split up that way, it can reuse the simple structure, so that we’d have storage/{00‥FF,local}/ as top-level directories…

My current storage only has 4519 actual directory entries, but (tested by creating a new directory with 4519 subdirectories) that’d be only about 70 kB directory size. Unfortunately, doing the equivalent of a DB vacuum on a directory is not entirely possible (it can be done by moving the structure to a new directory, unless it’s the filesystem root; e2fsck -fyDC0 can do it but only offline and takes long and needs a subsequent e2fsck -fyC0 (without the -D) to fix up (which takes almost as long) and it’s ext2/3/4 only; using reiserfs would work if that were an option; …).

But looking at these numbers, supporting 10000 users without a nested structure is easily done.

Hm, maybe just 16 instead of 256 toplevels would already help. So storage/{0‥F,local}/ and for not-local accounts, the hex digit would need to be a hash-of-sorts over the account ID (can be a really tiny one as long as it has good distribution).

Oh, sorry, I wrote too much again… //mirabilos -- [...] if maybe ext3fs wasn't a better pick, or jfs, or maybe reiserfs, oh but what about xfs, and if only i had waited until reiser4 was ready... in the be- ginning, there was ffs, and in the middle, there was ffs, and at the end, there was still ffs, and the sys admins knew it was good. :) -- Ted Unangst über *fs

hikari-no-yume commented 1 year ago

(For context, I deleted one of my replies quoted there when I realised it was on this issue rather than https://github.com/superseriousbusiness/gotosocial/issues/1776. Sorry for any confusion.)

hikari-no-yume commented 1 year ago

Hmm, I think I'm going to have to write a cron job to sense when the server's disk space is getting low and panic-delete all remote media. It keeps being a problem.

tsmethurst commented 1 year ago

Have you updated your instance recently? We did some work in 0.11.1 which saves a LOT of disk space (~10GB in my case).

https://github.com/superseriousbusiness/gotosocial/pull/2143

hikari-no-yume commented 1 year ago

I haven't; thanks for letting me know about the improvements! I'm likely to update soon (once #2183 is done).

hikari-no-yume commented 1 year ago

It seems I have the same problem with the newer version I'm using now. I really need to write that cron job.

mirabilos commented 1 year ago

hikari_no_yume dixit:

It seems I have the same problem with the newer version I'm using now.

Did you reduce media-remote-cache-days in config.yaml already?

I really need to write that cron job.

Or that, yeah…

Might also be a hit from https://github.com/superseriousbusiness/gotosocial/issues/2146

hikari-no-yume commented 9 months ago

So my VPS kept running out of space~~, even after setting media-remote-cache-days~~ (edit: that's not true, oops, but I did clear it somewhat regularly). Turns out the other thing eating disk space had been my GoToSocial log file! It got to 5.7GB by the end. I don't think I see a config setting for managing that, so maybe I need an external service to do it for me?

daenney commented 9 months ago

Log file rotation is either managed with something like logrotate, or done automatically by your init-system like in the case of systemd.

You probably already have logrotate on your system, and you can follow a tutorial like this one to configure it: https://www.digitalocean.com/community/tutorials/how-to-manage-logfiles-with-logrotate-on-ubuntu-20-04

The other option is to let GtS log directly to syslog with its syslog integration, as described in the docs: https://docs.gotosocial.org/en/latest/configuration/syslog/

In the case of Docker, you need to configure the logging driver, for example: https://docs.docker.com/config/containers/logging/local/

mirabilos commented 9 months ago

Daenney dixit:

Log file rotation is either managed with something like logrotate

I’m not sure logrotate will (easily) work, as it sits separate from the program that is logging, so it needs to restart GtS every time it rotates the log.

, or done automatically by your init-system like in the case of systemd.

(Or container runtimes, which also tend to catch stdout.)

Or run it under DJB dæmontools (or its clone runit, I suppose) where a separate program (multilog) catches the stdout from GtS and writes it into periodic logfiles.

The other option is to let GtS log directly to syslog with its syslog integration, as described in the docs:

Or that, of course.

bye, //mirabilos

daenney commented 9 months ago

Daenney dixit: Log file rotation is either managed with something like logrotate I’m not sure logrotate will (easily) work, as it sits separate from the program that is logging, so it needs to restart GtS every time it rotates the log.

That's why logrotate has copytruncate. It copies the log over and truncates the current file to 0. The logging process can remain unaware of the change as the file/fd it's logging to doesn't change.

superseriousbusiness / gotosocial