remotion-dev / remotion

🎥 Make videos programmatically with React
https://remotion.dev
Other
20.84k stars 1.05k forks source link

Renderer not utilizing CPU cores even with concurrency 100% #4300

Open tzvc opened 2 months ago

tzvc commented 2 months ago

Hey there,

I'm trying to optimize my rendering service for speed by running it on a beefy VPS (48 cores, 350Gb RAM). The problem is, the renderer does not seem to utilize the available resources, even worse, the render becomes slower as I increase the concurrency in the renderMedia() call.

Here's what my resource consumption looks like when running a render with concurrency: 1:

Screenshot 2024-09-13 at 12 34 46

Most cores are sitting idle, render is slow. As expected.

Now if I bump to concurrency: "100%" I expect the render to spread across all 48cores and RAM. But instead, this is what I get:

Screenshot 2024-09-13 at 13 06 49

Again, most of the cores are sitting idle. Even more than with concurrency:1 leading to a even slower render.

What is weird is if I run the same code (with concurrency: "100%") directly on my local machine (Macbook air M2), the renderer seems to utilize my 8 cores, render is fast as expected:

Screenshot 2024-09-13 at 13 12 40

I'm using the latest version of Remotion (4.0.211).

Here's the config I pass to renderMedia()

    const videoBuffer = await renderMedia({
      dumpBrowserLogs: true,
      logLevel: "verbose",
      composition,

      // offthreadVideoCacheSizeInBytes: 6 * 1024 * 1024 * 1024, // 6Gb
      chromiumOptions: {
        // headless: false,
        enableMultiProcessOnLinux: true,
        disableWebSecurity: true,
        ignoreCertificateErrors: true,
      },

      // concurrency: "100%",
      concurrency: 1,
      timeoutInMilliseconds: 1000 * 60 * 10, // 10 minutes
      serveUrl: process.env.REMOTION_BUNDLE_URL as string,
      codec: "h264",

Here's my Dockerfile

FROM node:20-bookworm
# Install Chrome dependencies
RUN apt-get update
RUN apt install -y \
  libnss3 \
  libdbus-1-3 \
  libatk1.0-0 \
  libgbm-dev \
  libasound2 \
  libxrandr2 \
  libxkbcommon-dev \
  libxfixes3 \
  libxcomposite1 \
  libxdamage1 \
  libatk-bridge2.0-0 \
  libcups2

RUN apt-get install -y wget unzip fontconfig

# Download and install all Google Fonts
RUN wget https://github.com/google/fonts/archive/main.zip -P /tmp \
    && unzip /tmp/main.zip -d /tmp \
    && mkdir -p /usr/share/fonts/truetype/google-fonts/ \
    && find /tmp/fonts-main/ -name "*.ttf" -exec install -m644 {} /usr/share/fonts/truetype/google-fonts/ \; \
    && fc-cache -f -v

# ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium

# Create and change to the app directory.
WORKDIR /usr/src/app

# Copy application dependency manifests to the container image.
# Copying this separately prevents re-running npm install on every code change.
COPY package.json ./

# Install production dependencies.
RUN npm install

# Install Chrome
RUN npx remotion browser ensure

# Copy local code to the container image.
COPY . ./

# to use external env variables
RUN rm .env 

RUN npm run build

RUN rm -rf src

# Run the web service on container startup.
CMD [ "npm", "start" ]

Here's my system specs:

Screenshot 2024-09-13 at 12 46 24

Has anyone encountered similar issue? What am I missing here?

JonnyBurger commented 2 months ago

All the concurrency tab does it set the number of Chrome tabs that are open at the same time. Chrome then manages the resources themselves and it all goes through one single instance of the Chrome DevTools protocol. We don't control or fully understand it's complex resource management either.

Empirically, we see a lot that setting a high concurrency value leads to diminishing or even worse results, probably because of the reason above.

One thing that you might consider trying is to make multiple renders with separate browser instances by using the openBrowser API and passing it to selectComposition and renderMedia().

Then you have two separate Chrome instances and also two separate Chrome DevTools Protocol channels which is used for communication. If you find out more, we'd be happy to incorporate your experiences into the docs.

tzvc commented 2 months ago

I understand that chrome manages it's own internal resources, but wouldn't this behavior be the same on all systems? I find it odd that the same chromium binary use up all the available ressources on my system but not in docker on my server.

From my testing, there also seems to be an inverse correlation between concurrency and the render speed: if I set concurrency to 2, the render time is 2x longer than with concurrency 1. As if each new tab reduced the resource allocated by tab linearly.

In my effort to try an tell chrome to use up more ressource, here's what I tried so far:

Made sure my container had no resource limits -> ok

Run in headful mode with a virtual display to hopefully hint chrome to use up more resources -> same results, only 4% CPU usage

Set CPU priority to all chrome-headless processes : renice -n -20 -p $(pgrep chrome-headless) -> same results, only 4% CPU usage

What I found tho is that if I start multiple renders in parallel, they seem to add up the resources they use (see below). I might try to split the rendering into X renderMedia calls and merge the resulting parts at the end

Screenshot 2024-09-13 at 16 10 49
tzvc commented 2 months ago

Continued my investigation, wanted to rule out Docker preventing chrome from accessing the ressources. Tried running on a bare metal Debian bookworm server, same issue: only a small % of the system resources is used.

Has anyone managed to get got multi-core performance using the latest version of remotion?

JonnyBurger commented 2 months ago

@tzvc Building your own distributed renderer is challenging and not recommended for most.

Maybe you are rendering an OffthreadVideo with an expensive embedded video? Extracting the frames from a video is a process that cannot be well parallelized.

We could verify this theory if you get better utilization if you render something else (like images)

I found some threading options in FFmpeg that we need to explore. If this is the bottleneck, we should try tweaking these params https://stackoverflow.com/a/74309843/986552

tzvc commented 2 months ago

@JonnyBurger My production compositions are heavy on OffthreadVideos. But I also tried rendering compositions with only simple images and even tho the rendering is faster, I observe the same behavior: only a few % of the system resources are used.

This is a snapshot of my system when rendering a composition comprised of images and text (concurrency set to 100% on a 32 core system):

Screenshot 2024-09-18 at 17 44 19

Render logs on startup:

 renderMedia()  Free memory: 267359170560 Estimated usage parallel encoding 2073600000
 renderMedia()  Using concurrency: 57
 renderMedia()  delayRender() timeout: 600000
 renderMedia()  Codec supports parallel rendering: true
 renderMedia()  Parallel encoding is enabled.
 renderMedia()  Rendering frames 0-3885
 prespawnFfmpeg()  Generated FFMPEG command:
 prespawnFfmpeg()  -r,30 -f,image2pipe -s,1080x1920 -vcodec,mjpeg -i,- -c:v,libx264 -pix_fmt,yuv420p -crf,18 -y /tmp/react-motion-render4NRiCJ/pre-encode.mp4
Created directory for temporary files /tmp/remotion-v4-0-211-assetscpuxfsvmh6
 compositor  Starting Rust process. Max video cache size: 20000MB, max concurrency = 57
 openBrowser()  Opening browser: gl = undefined, executable = /home/contact_theochampion/krs/node_modules/.remotion/chrome-headless-shell/linux64/chrome-headless-shell-linux64/chrome-headless-shell, enableMultiProcessOnLinux = true
 chrome  policy_logger.cc:145: :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping mandatory platform policies because no policy file was found at: /etc/opt/chrome_for_testing/policies/managed
 chrome  sandbox_linux.cc:418: InitializeSandbox() called with multiple threads in process gpu-process.
tzvc commented 2 months ago

For reference, rendering the same image and text only composition on my 8 core MacBook yield better performance than my 32 core server (previous message): 60 average fps vs 30fps

JonnyBurger commented 2 months ago

This is somewhat understandable for me. Not every part can be parallelized! Sometimes the overhead of orchestrating many threads is bigger than the benefit.

If you run a Node.js server on 64 cores, 63 of them will do exactly nothing by default. Remotion is also a Node.js program, albeit it has some ways of using multiple cores:

tzvc commented 2 months ago

I get that the main node.js process is single-threaded, but the load on this main process should be fairly light if its only role is to orchestrate the others processes (chrome tabs, and compositor) responsible for the heavy lifting right?

I'm digging into the renderer code now, I think I understand better how everything works (very cool btw!) and what the bottlenecks could be.

From my testing here's the bottlenecks I could identify:

Chrome Ressource Allocation: When increasing concurrency (effectively the number of concurrent tabs) it seems Chrome does not blindly allocates more compute/memory to the new tabs. In fact, the performance gain from adding more concurrency seems to decrease logarithmically.

This is an example showing render time for a 1min composition of only 1 image for multiple concurrency on a 64 core CPU:

Potential solution:

Since each browser instance seems to have a cap on the ressource allocation, as @JonnyBurger suggested a simple solution is to multiply the number of concurrent browser instances. Simply chunk the composition into X range of frames and render each chunk on a separate browser instance concurrently (using openBrowser) then stich the resulting videos back together using ffmpeg.

With this method, I was able to get more out of my system resources. Here are the results for the same 1min video, with the concurrency set to 4 for each browser instance (same 64 core CPU)

Again, we see a logarithmic decay of the performance as the number of instances increases so there is still room for improvements.

My guess is at this level of concurrency, the bottleneck is somewhere else, which leads me to:

Frame extraction for OffThreadVideo Everytime a frame containing a OffThreadVideo is rendered, the compositor extract the frame at the specified time and returns it. Even tho extracting a frame is fairly fast, this could lead to a bottleneck when lots of pages request frames concurrently, especially when the requested frames are not sequential.

Potential solution: Optimimistically extract and cache frames in batches: when a frame is requested at time X on a video, we could extract this frame, return it, and optimistically extract and cache the next 10 in a single operation as they are likely to be requested later. Or even, if the system allows, an option to pre cache all the frames that will be required for the composition in batches.

@JonnyBurger what do you think about this? I'd like to experiment with optimistic caching of the offthread video frames. Do you think of a way I could hack together something to test this theory without having touch the renderer's code? Is there a way to start the offthread video server externally ?

tzvc commented 1 month ago

here's the result of testing combining tab concurrency and instance concurrency when rendering offthread videos:

What's interesting though is that, when rendering composition with no OffthreadVideos there is no performance degradation when running multiple instances: whether I render 1 or 8 videos at the same time, the FPS stay stable. This would indicate that the bottleneck for OffthreadVideo is the compositor, not being able to serve frames fast enough.

When I look at the processes running I see only 1 compositor process, even if I run 10 renderFrames() in parallel? @JonnyBurger is that by design? Is there a way to start multiple compositor processes?

loretoparisi commented 1 month ago

here's the result of testing combining tab concurrency and instance concurrency when rendering offthread videos:

What's interesting though is that, when rendering composition with no OffthreadVideos there is no performance degradation when running multiple instances: whether I render 1 or 8 videos at the same time, the FPS stay stable. This would indicate that the bottleneck for OffthreadVideo is the compositor, not being able to serve frames fast enough.

When I look at the processes running I see only 1 compositor process, even if I run 10 renderFrames() in parallel? @JonnyBurger is that by design? Is there a way to start multiple compositor processes?

@tzvc I'm looking at the same, using

const chromiumOptions: ChromiumOptions = {
    disableWebSecurity: true,
    enableMultiProcessOnLinux: true,
    gl: "angle",
    userAgent: RENDERER_USER_AGENT,
};

with

const availableCpus = Math.min(os.cpus().length, 4);
export const OFFTHREAD_CACHE_SIZE_IN_BYTES = 2 * 1024 * 1024 * 1024 // 2GB

how you test frames parallezation and what was the batch size? Thanks!

JonnyBurger commented 1 month ago

I tried the FFmpeg threading options I was talking about, but I could not find that it significantly changed the outcome.

I'm open to refactoring the concurrency system in November to allow specifying the tabs + browser instances instead of just tabs if you say this works, although it doesn't sound conclusive.

This would indicate that the bottleneck for OffthreadVideo is the compositor, not being able to serve frames fast enough.

The chart looks realistic - extracting frames from a video is a linear process, meaning a frame can only be extracted after the previous frame has been extracted. Hence multithreading possibilities are limited.

Remotion will open the video multiple times if there are frames requested that are 15 seconds apart, because then a single stream would not suffice.

I think opening multiple identical video streams with little time difference will lead to a lot of duplicate work.

Can't think of a obvious solution for this.

loretoparisi commented 1 month ago

@JonnyBurger is it possible to test by our side as well? Thank you very much! 👍🏾

samohovets commented 1 month ago

@tzvc thanks for the cool research!

I experimented with renting a beefy VPS months ago and got stuck at the same problem as you: I noticed it's just Chrome throttling the tabs. I decided not to investigate it further, thinking it was a hard problem to spend a lot of time on (thinking about how to get around Chrome limitations).

We're heavily using Three.js in our compositions, so I had to find perfect concurrency for faster rendering and not overload the GPU. The results of my experiments were close to yours – sometimes decreasing concurrency helped a lot, but not always.

Your solution with multiple browsers is quite interesting, it could work I think. It feels like we'll never know if it's gonna work until we build a basic prototype for this.