Open nolanlawson opened 3 years ago
If Mastodon is just shelling out to ffmpeg
, maybe there could be an option to simply prepend all calls with a nice
value set by the admin.
Surprisingly, I have few such issues on my smallish server (though it's a single-user instance), but I can definitely see ffmpeg
and imagemagick
being the most resource-hungry processes in Mastodon.
nice
would help with the CPU load but not with the RAM usage, which could be a limiting factor too (especially when imagemagick
is involved).
Being able to throttle ffmpeg/imagemagick processes would be ideal, but it's pretty tricky as they can be spawned from multiple worker classes… and also in response to synchronous requests (typically, /api/v1/media
and the profile parts, at the very least). So processing those requests would have to either ignore the throttling (which would limit its usefulness, though if the bulk of your issues come from federation it would still help), or we would have to introduce the ability for these requests to be rejected (which would badly degrade the user experience).
Thanks for the reply. I know ffmpeg
is a blocking operation for some user-facing features like media upload. I guess my thought process was that the request could be allowed to hang for a long time while waiting in the queue. And yes, possibly time out with an error. But in my case, it would be preferable for one or two users to get an error than for the whole server to go down. :slightly_smiling_face:
I suppose the ideal solution would be for ffmpeg processing to be an asynchronous background operation? As in, serve a lightly-processed version of the video (or even the raw video) while waiting for the fully-processed version? Facebook apparently does something like this.
The v2 media upload endpoint is asynchronous, but we have not yet dropped support for v1. Other than that most use of ffmpeg is from the workers, but there is no separation between ffmpeg workers and everything else. The first step would be splitting the media processing work out from the other workers. Once that is done, there are sidekiq middleware gems for rate limiting
From #20064 Offer flexibility to limit load caused by video transcoding
We should have better control over
to limit the danger of DoS effects of mutliple parallel video uploads.
Recently we experienced high load on our instance (running on a single 2 core machine) which caused inability to upload media to new posts in the web UI and also slowed down things notably in general.
We saw that there were 5 to 6 ffmpeg processes running in parallel. The entire load phase lasted for more than one hour in the worst case.
Here is a visualization of the CPU load during that time.
Unfortunately I was unable to find any means to control these jobs.
I checked the sidekiq dashboard but didn't find anything related. This made me wonder: Are video transcoding jobs even managed through sidekiq? The process tree shown by ps faux
indicates to me that the ffmpeg
commands are not executed through sidekiq workers. Which makes we wonder why. Ideally these would be sidekiq jobs in a separate queue which we could offload to a separate machine.
It occurs to me that a simple concurrency limit configuration for this type of job would make sense, so that the likelihood to fully saturate the CPU can be reduced.
Also, limiting the execution time (and then fail the job without retry) would mean that videos that passed the uplaod size limit (which I found to be 40 MB), but still are too complex to transcode, would not block the instance.
Lastly, it would be benefitial if we could customize how ffmpeg
is called. For reasons unknown to me there is code to read an FFMPEG_BINARY
environment variable here:
https://github.com/mastodon/mastodon/blob/v4.0.0rc2/config/initializers/ffmpeg.rb
From my tests, and from looking at how ffmpeg gets executed, this variable has no effect.
Being able to point this variable to a wrapper script, to apply tools like cpulimit
and nice
, seems like a good way to give a lot more power to admins to limit the blast radius.
Mastodon is becoming more important. Let's also reduce its attack surface and resource consumption where reasonably possible.
I checked the sidekiq dashboard but didn't find anything related. This made me wonder: Are video transcoding jobs even managed through sidekiq? The process tree shown by
ps faux
indicates to me that theffmpeg
commands are not executed through sidekiq workers. Which makes we wonder why. Ideally these would be sidekiq jobs in a separate queue which we could offload to a separate machine.
Transcoding should happen in background jobs, but ffmpeg
can occasionally be called from puma
for cheaper tasks you want an immediate result from, like extracting a thumbnail. It's also used synchronously in an older version of the media upload API that is expected to only return after the media file has been processed. We should remove that older endpoint at some point.
Thanks for the explanation @ClearlyClaire ! The info here in this issue, which I only learned about after posting mine, helped indeed.
So I have to assume that in our case the v1 endpoint got used. Indeed, deprecating and phasing out that endpoint would be one part of the solution.
I'd be happy if the other suggestions could be evaluated anyway.
AIUI even with the v2 API, it is possible for the server to get swamped.
For context, I run a small instance with a few hundred users on a t3.medium EC2 instance (2 vCPUs, 4GB memory). For whatever reason, if someone tries to upload a bunch of videos at once, it seems to cause the server to become unresponsive until I reboot it. Maybe this person is using the v1 API, but it seems to me that the v2 API would be able to cause this too. (Maybe it's due to the memory usage rather than CPU usage? The server usually runs close to the limit, around 3.5GB/4GB.)
This is not acceptable behaviour.
I just had one ffmpeg process take up pretty much all processing power on one of my 8-core VMs that are exclusively dedicated to puma for a solid 5 minutes. As a result all web requests to that VM became very unresponsive. There definitely needs to be some throttling or a way to move all ffmpeg invocations into workers.
This can be an issue even on larger servers. This solution #14371 helped me greatly on running multiple instances inside the same server because I was running into this issue frequently. Still, probably not useful for most people.
I do think that this can cause a massive problem in the network. When multiple video/animated gifs are federated in short intervals (with nefarious intentions or not), it can be an issue.
I completely agree with the suggestions from @nolanlawson and @marians to limit the number of ffmpeg
transcoding that can happen simultaneously and also a timeout after X seconds. It's better to have a 'media not available' notice on a few federated posts (user clicks and accesses remote original media) than have a server become unresponsive.
Chiming in as well, I've been getting multiple errors (below) as a result of ffmpeg processing on my instance. All the errors correlate to this ffmpeg process basically taking over operations, and in most cases, there's not even any logging after this process starts (until it's killed).
yunohost bundle[46113]: [46113] ! Terminating timed out worker (worker failed to check in within 60 seconds): 1361544
Maybe we can consider utilizing an external server for video transcoding. I have a article detailing my recent work related to this topic.
Adding my voice to this issue. Running a small instance on a 2 cores, 4GB of ram (T3a.medium on AWS), whenever I upload videos the CPU jumps to 100% use, rendering the instance unusable and even sometime crashing it so hard I have to restart all processes. Not fun.
ffmpeg is already parallel, it makes little sense to run so many ffmpeg processes, we really need to be able to throttle their number, otherwise a lot of memory is spent, and all these threads only step on each others in the caches, so it's slower in the end than using just one or two ffmpeg processes.
Like Peertube we want to choose the number of ffmpeg launched
Being able to add a prefix to the ffmpeg command, we could use e.g.
parallel --fg --semaphorename toot-ffmpeg -j 8
Pitch
Running a small Mastodon instance on a single server, where Puma, Sidekiq, streaming, Postgres, and everything else all live together,
ffmpeg
is a huge liability. Just running two or threeffmpeg
processes simultaneously, on my particular server, is enough to consume 2 vCPUs and a lot of the available memory. If too manyffmpeg
processes are running at once, the server can become unresponsive and I need to restart it.I would like a server setting to limit the number of
ffmpeg
processes (and possiblyimagemagick
processes) that can run simultaneously. For instance, on my server, I might set this to 1 or 2.The downside is that users would have to wait longer when they're uploading videos, or when videos are federated from other instances. But this seems like a small price to pay so that the server doesn't go down entirely. :slightly_smiling_face:
Motivation
Server admins running small instances, where they can't easily isolate
ffmpeg
to its own machine where it doesn't impact the app server, would benefit a lot from being able to keepffmpeg
processes under control.