[question] Ever-increasing rate of allocations

tsmethurst commented 2 weeks ago

Hello hello! Firstly I wanna say a big thank you for wazero, it's enabled a lot of really funky stuff for GoToSocial such as running ffmpeg as a wasm module, which has been really exciting to implement!

Unfortunately despite all that wazero brings to GtS, we're having a bit of trouble with one particular issue as we're ironing out the implementation on the way to our first release including this code. Specifically, we've noticed recently in our metrics on a running GoToSocial instance that the rate of allocations seems to increase slowly over time. For example, in the following graph:

Screenshot from 2024-08-28 14-13-22

Here you can see that over a period of a couple of days, the rate at which objects are allocated seems to increase. This leads to spikier memory usage overall, increased garbage collection, and eventually OOM errors as the OS kills the GtS process.

For comparison, here's a screenshot of the allocations on a previous version of GoToSocial before we started using wazero for ffmpeg.

To try to figure out what's happening, I've been looking through our allocs via pprof and the results have been frustratingly tight-lipped in terms of what the issue might actually be. I also tried running some tests where I repeatedly reprocess images, but I can't seem to reproduce the problem on a testbed.

So I'm wondering, do you all have any idea what might be happening here, and know of any reason why the rate of allocations might be increasing over time in this way? For a bit of context, this is how we currently instantiate wazero modules: https://github.com/superseriousbusiness/gotosocial/blob/main/internal/media/ffmpeg/wasm.go. Ie., create the runtime once, compile the module once, and then instantiate a module from the compiled module every time we run.

Thanks for reading :)

ncruces commented 2 weeks ago

Thanks for the report. Are these Go (managed/garbage collected) memory allocs, or memory usage? I guess what I'm asking is how are these collected?

Also, is there a way to disable this specific usage of wazero, to make sure this is the culprit? Looking at the code, nothing jumps out.

tsmethurst commented 2 weeks ago

Thanks for the reply!

Are these Go (managed/garbage collected) memory allocs, or memory usage? I guess what I'm asking is how are these collected?

These are gathered using open telemetry, exported using otel's prometheus exporter, and then graphed using a grafana instance. Setup for this in GtS is done here: https://github.com/superseriousbusiness/gotosocial/blob/main/internal/metrics/metrics.go. The graph shown just uses the rate prometheus query over go_memstats_mallocs_total, so nothing fancy going on there.

Also, is there a way to disable this specific usage of wazero, to make sure this is the culprit?

I'll see if I can do some further isolation to test whether this is indeed a wazero problem, it's proving a bit annoying to replicate. My current plan to see if I can further narrow things down is to deploy some code that recompiles the ffmpeg and ffprobe modules every 200 uses or so, just to see if maybe the compiled module is holding onto something it shouldn't be.

Looking at the code, nothing jumps out.

Thank you for taking a look! I'm glad we're not doing anything ridiculous at first sight at least :')

tetratelabs / wazero

[question] Ever-increasing rate of allocations #2307