Option to throttle ffmpeg processes

nolanlawson commented 3 years ago

Pitch

Running a small Mastodon instance on a single server, where Puma, Sidekiq, streaming, Postgres, and everything else all live together, ffmpeg is a huge liability. Just running two or three ffmpeg processes simultaneously, on my particular server, is enough to consume 2 vCPUs and a lot of the available memory. If too many ffmpeg processes are running at once, the server can become unresponsive and I need to restart it.

I would like a server setting to limit the number of ffmpeg processes (and possibly imagemagick processes) that can run simultaneously. For instance, on my server, I might set this to 1 or 2.

The downside is that users would have to wait longer when they're uploading videos, or when videos are federated from other instances. But this seems like a small price to pay so that the server doesn't go down entirely. :slightly_smiling_face:

Motivation

Server admins running small instances, where they can't easily isolate ffmpeg to its own machine where it doesn't impact the app server, would benefit a lot from being able to keep ffmpeg processes under control.

dariusk commented 3 years ago

If Mastodon is just shelling out to ffmpeg, maybe there could be an option to simply prepend all calls with a nice value set by the admin.

ClearlyClaire commented 3 years ago

Surprisingly, I have few such issues on my smallish server (though it's a single-user instance), but I can definitely see ffmpeg and imagemagick being the most resource-hungry processes in Mastodon.

nice would help with the CPU load but not with the RAM usage, which could be a limiting factor too (especially when imagemagick is involved).

Being able to throttle ffmpeg/imagemagick processes would be ideal, but it's pretty tricky as they can be spawned from multiple worker classes… and also in response to synchronous requests (typically, /api/v1/media and the profile parts, at the very least). So processing those requests would have to either ignore the throttling (which would limit its usefulness, though if the bulk of your issues come from federation it would still help), or we would have to introduce the ability for these requests to be rejected (which would badly degrade the user experience).

nolanlawson commented 3 years ago

Thanks for the reply. I know ffmpeg is a blocking operation for some user-facing features like media upload. I guess my thought process was that the request could be allowed to hang for a long time while waiting in the queue. And yes, possibly time out with an error. But in my case, it would be preferable for one or two users to get an error than for the whole server to go down. :slightly_smiling_face:

I suppose the ideal solution would be for ffmpeg processing to be an asynchronous background operation? As in, serve a lightly-processed version of the video (or even the raw video) while waiting for the fully-processed version? Facebook apparently does something like this.

Gargron commented 3 years ago

The v2 media upload endpoint is asynchronous, but we have not yet dropped support for v1. Other than that most use of ffmpeg is from the workers, but there is no separation between ffmpeg workers and everything else. The first step would be splitting the media processing work out from the other workers. Once that is done, there are sidekiq middleware gems for rate limiting

marians commented 2 years ago

From #20064 Offer flexibility to limit load caused by video transcoding

Pitch

We should have better control over

How many videos are transcoded in parallel
Timeout for video transcoding jobs
Exact ffmpeg command executed (point to custom binary or wrapper script)

to limit the danger of DoS effects of mutliple parallel video uploads.

Motivation

Recently we experienced high load on our instance (running on a single 2 core machine) which caused inability to upload media to new posts in the web UI and also slowed down things notably in general.

We saw that there were 5 to 6 ffmpeg processes running in parallel. The entire load phase lasted for more than one hour in the worst case.

Here is a visualization of the CPU load during that time.

Unfortunately I was unable to find any means to control these jobs.

I checked the sidekiq dashboard but didn't find anything related. This made me wonder: Are video transcoding jobs even managed through sidekiq? The process tree shown by ps faux indicates to me that the ffmpeg commands are not executed through sidekiq workers. Which makes we wonder why. Ideally these would be sidekiq jobs in a separate queue which we could offload to a separate machine.

It occurs to me that a simple concurrency limit configuration for this type of job would make sense, so that the likelihood to fully saturate the CPU can be reduced.

Also, limiting the execution time (and then fail the job without retry) would mean that videos that passed the uplaod size limit (which I found to be 40 MB), but still are too complex to transcode, would not block the instance.

Lastly, it would be benefitial if we could customize how ffmpeg is called. For reasons unknown to me there is code to read an FFMPEG_BINARY environment variable here:

https://github.com/mastodon/mastodon/blob/v4.0.0rc2/config/initializers/ffmpeg.rb

From my tests, and from looking at how ffmpeg gets executed, this variable has no effect.

Being able to point this variable to a wrapper script, to apply tools like cpulimit and nice, seems like a good way to give a lot more power to admins to limit the blast radius.

Mastodon is becoming more important. Let's also reduce its attack surface and resource consumption where reasonably possible.

ClearlyClaire commented 2 years ago

I checked the sidekiq dashboard but didn't find anything related. This made me wonder: Are video transcoding jobs even managed through sidekiq? The process tree shown by ps faux indicates to me that the ffmpeg commands are not executed through sidekiq workers. Which makes we wonder why. Ideally these would be sidekiq jobs in a separate queue which we could offload to a separate machine.

Transcoding should happen in background jobs, but ffmpeg can occasionally be called from puma for cheaper tasks you want an immediate result from, like extracting a thumbnail. It's also used synchronously in an older version of the media upload API that is expected to only return after the media file has been processed. We should remove that older endpoint at some point.

marians commented 2 years ago

Thanks for the explanation @ClearlyClaire ! The info here in this issue, which I only learned about after posting mine, helped indeed.

So I have to assume that in our case the v1 endpoint got used. Indeed, deprecating and phasing out that endpoint would be one part of the solution.

I'd be happy if the other suggestions could be evaluated anyway.

Creating a separate sidekiq queue for ffmpeg jobs, to be able to isolate these jobs on a separate machine, and/or to configure concurrency specifically for this type of job
Allowing use of a wrapper script

nolanlawson commented 2 years ago

AIUI even with the v2 API, it is possible for the server to get swamped.

For context, I run a small instance with a few hundred users on a t3.medium EC2 instance (2 vCPUs, 4GB memory). For whatever reason, if someone tries to upload a bunch of videos at once, it seems to cause the server to become unresponsive until I reboot it. Maybe this person is using the v1 API, but it seems to me that the v2 API would be able to cause this too. (Maybe it's due to the memory usage rather than CPU usage? The server usually runs close to the limit, around 3.5GB/4GB.)

M0YNG commented 1 year ago

BashTOP showing 4 CPUs pegged at 100% for literally minutes due to ffmpeg

This is not acceptable behaviour.

lutoma commented 1 year ago

I just had one ffmpeg process take up pretty much all processing power on one of my 8-core VMs that are exclusively dedicated to puma for a solid 5 minutes. As a result all web requests to that VM became very unresponsive. There definitely needs to be some throttling or a way to move all ffmpeg invocations into workers.

hugogameiro commented 1 year ago

This can be an issue even on larger servers. This solution #14371 helped me greatly on running multiple instances inside the same server because I was running into this issue frequently. Still, probably not useful for most people.

I do think that this can cause a massive problem in the network. When multiple video/animated gifs are federated in short intervals (with nefarious intentions or not), it can be an issue.

I completely agree with the suggestions from @nolanlawson and @marians to limit the number of ffmpeg transcoding that can happen simultaneously and also a timeout after X seconds. It's better to have a 'media not available' notice on a few federated posts (user clicks and accesses remote original media) than have a server become unresponsive.

binarydad commented 1 year ago

Chiming in as well, I've been getting multiple errors (below) as a result of ffmpeg processing on my instance. All the errors correlate to this ffmpeg process basically taking over operations, and in most cases, there's not even any logging after this process starts (until it's killed).

yunohost bundle[46113]: [46113] ! Terminating timed out worker (worker failed to check in within 60 seconds): 1361544

gnh1201 commented 1 year ago

Maybe we can consider utilizing an external server for video transcoding. I have a article detailing my recent work related to this topic.

How to accelerate FFmpeg in Mastodon https://gist.github.com/gnh1201/1ba49e0e80a11237038900bf8abfa434

tomakun commented 1 year ago

Adding my voice to this issue. Running a small instance on a 2 cores, 4GB of ram (T3a.medium on AWS), whenever I upload videos the CPU jumps to 100% use, rendering the instance unusable and even sometime crashing it so hard I have to restart all processes. Not fun.

sthibaul commented 8 months ago

ffmpeg is already parallel, it makes little sense to run so many ffmpeg processes, we really need to be able to throttle their number, otherwise a lot of memory is spent, and all these threads only step on each others in the caches, so it's slower in the end than using just one or two ffmpeg processes.

sachaz commented 8 months ago

Like Peertube we want to choose the number of ffmpeg launched

sthibaul commented 8 months ago

Being able to add a prefix to the ffmpeg command, we could use e.g.

parallel --fg --semaphorename toot-ffmpeg -j 8

mastodon / mastodon