Scaling is needlessly slow

Tjoppen commented 1 year ago

On decoding and scaling a 4k 10-bit ProRes file to 1k 8-bit using ffmpeg and using melt, melt is roughly 4x as slow as ffmpeg. On a 72-core machine (Intel(R) Xeon(R) Platinum 8223CL CPU @ 3.00GHz), 100 seconds of material takes ffmpeg 21 seconds whereas melt takes 92 seconds for a similar operation.

sudo sync; sudo bash -c "echo 1 > /proc/sys/vm/drop_caches" ; time ffmpeg -thread_type slice -threads 72 -i ~/samples/dpel.aswf.io/dpel_4k__ProRes_422HQ__24__443387.mov -vframes 2400 -vcodec rawvideo -pix_fmt yuv422p -s 1024x428 -acodec pcm_s16le -f avi -y /dev/null
real    0m21.224s
user    5m35.865s
sys     0m11.093s

compared to

sudo sync; sudo bash -c "echo 1 > /proc/sys/vm/drop_caches" ; time melt -producer avformat:/home/ubuntu/samples/dpel.aswf.io/dpel_4k__ProRes_422HQ__24__443387.mov in=1 out=2400 threads=72 thread_type=slice -consumer avformat:/dev/null f=avi vcodec=rawvideo pix_fmt=yuv422p acodec=pcm_s16le f=avi width=1024 height=428 frame_rate_num=24 frame_rate_den=1
real    1m32.490s
user    8m31.038s
sys     0m21.417s

In short we are piping 8-bit 1024x428 YUV422 + 16-bit PCM in AVI to other parts of our system. Digging a bit into the avformat module I find multiple problems.

Vertical slices in sliced_h_pix_fmt_conv_proc()

In producer_avformat.c pixel format conversion from yuv422p10le to yuv422p is done. sliced_h_pix_fmt_conv_proc() uses vertical slices for some reason. This is horrible for cache coherence. Switching to horizontal slices ought to improve performance and should not be too hard. Another option is to use sws_scale_frame() like filter_swscale does.

MAX_THREADS in filter_swscale.c

MAX_THREADS is 6 in filter_swscale.c. I tried bumping this to 72 but this lead to half the throughput. Still, explicit limits like this is code smell.

Needlessly creating/destroying sws contexts

This is common to avformat_producer and filter_swscale. Creating a context is an expensive operation and should only be done when necessary. mlt_set_luma_transfer() falls into this category as well.

Pixel format conversion and scaling as separate operations

Yet more performance could be gained if producer_avformat did pixel format conversion and scaling as a single operation, thus producing conformed frames directly rather than relying on filter_swscale. This is perhaps not always desired. An option for it might be nice.

ddennedy commented 1 year ago

Vertical slices in sliced_h_pix_fmt_conv_proc()

This was contributed long before sws_scale_frame() became available in the most recent major version, and vertical was chosen as the way to handle interlace video. You can submit a patch to add sws_scale_frame() but you cannot yet completely take away the old code because that is not backwards compatible. MLT does not only work with the latest version. The addition of sws_scale_frame() to the filter was purely incremental, but you cannot take away a feature in the producer.

MAX_THREADS in filter_swscale.c

Smells fine to me based on my testing. Going higher rarely gave a better performance in my testing, and when combined with MLT's frame-threading it became even worse. You can submit a PR to make it a property, but it must default to my choice and good luck setting it. You can also submit a different default value along with the benchmark times and the command line used. Then, I will evaluate it as well. However, I will not be basing it upon 72 core machines as that is not what the majority of Shotcut and Kdenlive users are running.

You forgot to set sws_flags in your ffmpeg command line. ffmpeg defaults to inaccurate colors, and that was reported as a Shotcut bug. See mlt_get_sws_flags() in src/modules/avformat/common.c. You can submit a PR for this as well, but, again, not so easy to set. One idea is to add a property on the producer named set.swscale.flags. Producer treats "set." special and propagates these producer properties onto the frames it creates minus the "set." prefix. So, avformat producer can read it as set.swscale.flags while swscale filter reads it as swscale.flags. Thread count for sws_scale_frame() could be done similarly.

Tjoppen commented 1 year ago

vertical was chosen as the way to handle interlace video

I have a hard time seeing why this should be necessary. producer_avformat is only doing pixel format conversion (what sws calls unscaling), not actual scaling. So long as the vertical slice resolutions are even then even interlaced 4:2:0 should pass through fine. I'll look into using sws_scale_frame() when LIBSWSCALE_VERSION_MAJOR >= 6 since sws should do The Right Thing. Also I'll look into reusing sws contexts since allocating them is a non-trivial amount of the total user time according to callgrind.

However, I will not be basing it upon 72 core machines as that is not what the majority of Shotcut and Kdenlive users are running.

This is just a matter of time. But yes the frame threading is probably enough even for our use case. This potentially ties into thread pool stuff I'd like to do inside ffmpeg. If done right we can get the best of both worlds: low latency and high throughput, meaning fast scrubbing and fast rendering. For that to work melt and ffmpeg would need to use the same thread pool.

As for options, I've raised the issue of the options system in the mlt framework being a mess before. For now it's not a hindrance, luckily.

Thoughts on doing the format conversion + scaling in one go?

ddennedy commented 1 year ago

Thoughts on doing the format conversion + scaling in one go?

That breaks the order of things within the loader producer and loader.ini such as source-cropping and deinterlace coming before scaling.

Tjoppen commented 1 year ago

I suspected as much. We'll see how the other ideas pan out once I manage to get them tested.

Tjoppen commented 1 year ago

Worth noting is that the current behavior violates the contract of sws_scale():

/**
 * Scale the image slice in srcSlice and put the resulting scaled
 * slice in the image in dst. A slice is a sequence of consecutive
 * rows in an image.
 *
 * Slices have to be provided in sequential order, either in
 * top-bottom or bottom-top order. If slices are provided in
 * non-sequential order the behavior of the function is undefined.
 *

edit: actually the allocation of individual sws contexts for each slice saves the code from violating this

ddennedy commented 1 year ago

We can certainly convert the avformat producer to sws_scale_frame but not by removing the old code - ifdef it. Also, you can consolidate pixel format and scaling for your needs if you do not need those particular loader.ini filters, as a build or runtime option. I understand sws_scale_frame handles interlace now even for scaling. And if you do ever need source cropping or deinterlace, you can use the new filtergraph property on the producer.

Tjoppen commented 1 year ago

I understand sws_scale_frame handles interlace now even for scaling

Actually looking at vf_scale.c it appears it doesn't. But it's not overly difficult to make the necessary preparations.

Tjoppen commented 1 year ago

The filtergraph approach looks like it has potential for our use case. But the output looks very very wrong. I suspect pixel format conversion may be applied twice:

melt -producer avformat:$HOME/samples/dpel.aswf.io/dpel_4k__ProRes_422HQ__24__443387.mov in=1 out=2400 threads=72 thread_type=slice "filtergraph=scale=1024x428,format=yuv422p" -consumer avformat:/dev/null f=avi vcodec=rawvideo acodec=pcm_s16le f=avi width=1024 height=428 frame_rate_num=24 frame_rate_den=1

mpv-shot0002

This is with the latest commit (28a771928d).

ddennedy commented 1 year ago

filtergraph does not take the place of pixel format conversion. It is an additional thing that I did not add it. I have not used it, and I was not suggesting it as the means to combine scaling with pixel format conversion. Perhaps a small code change could check if pixel format conversion is required and skip it, or skipped explicitly by another property.

Tjoppen commented 1 year ago

That doesn't sound right. As it is currently in master it's a miracle it doesn't crash. Surely if the filtergraph produces a pixel format that mlt supports then that should be enough, no?

Tjoppen commented 1 year ago

See PR #865

ddennedy commented 1 year ago

I tested converting avformat producer to sws_getCachedContext, and it is actually a very tiny bit slower. I tested twice with a non-short clip (Big Buck Bunny). I believe all of the context member comparisons offsets the heap usage.

--- src/modules/avformat/producer_avformat.c
+++ src/modules/avformat/producer_avformat.c
@@ -1,5 +1,5 @@
 /*
  * producer_avformat.c -- avformat producer
- * Copyright (C) 2003-2022 Meltytech, LLC
+ * Copyright (C) 2003-2023 Meltytech, LLC
  *
  * This library is free software; you can redistribute it and/or
@@ -87,4 +87,5 @@ struct producer_avformat_s
    AVCodecContext *audio_codec[ MAX_AUDIO_STREAMS ];
    AVCodecContext *video_codec;
+   struct SwsContext *sws_context;
    AVFrame *video_frame;
    AVFrame *audio_frame;
@@ -769,9 +770,7 @@ static int get_basic_info( producer_avformat self, mlt_profile profile, const ch
        if ( pix_fmt != AV_PIX_FMT_NONE ) {
            // Verify that we can convert this to one of our image formats.
-           struct SwsContext *context = sws_getContext( codec_params->width, codec_params->height, pix_fmt,
+           self->sws_context = sws_getCachedContext(self->sws_context, codec_params->width, codec_params->height, pix_fmt,
                codec_params->width, codec_params->height, pick_pix_fmt( pix_fmt ), SWS_BILINEAR, NULL, NULL, NULL);
-           if ( context )
-           {
-               sws_freeContext( context );
+           if (self->sws_context) {
                mlt_image_format format = pick_image_format(pix_fmt, self->full_range, mlt_image_yuv422);
                mlt_properties_set_int( properties, "format", format );
@@ -1510,5 +1509,5 @@ static int convert_image( producer_avformat self, AVFrame *frame, uint8_t *buffe
        // avformat with no filters and explicitly requested.
        int flags = mlt_get_sws_flags(width, height, src_pix_fmt, width, height, AV_PIX_FMT_YUV420P);
-       struct SwsContext *context = sws_getContext(width, height, src_pix_fmt,
+       self->sws_context = sws_getCachedContext(self->sws_context, width, height, src_pix_fmt,
            width, height, AV_PIX_FMT_YUV420P, flags, NULL, NULL, NULL);

@@ -1521,14 +1520,13 @@ static int convert_image( producer_avformat self, AVFrame *frame, uint8_t *buffe
        out_stride[1] = width >> 1;
        out_stride[2] = width >> 1;
-       if ( !mlt_set_luma_transfer( context, self->yuv_colorspace, profile->colorspace, self->full_range, dst_full_range ) )
+       if ( !mlt_set_luma_transfer(self->sws_context, self->yuv_colorspace, profile->colorspace, self->full_range, dst_full_range ) )
            result = profile->colorspace;
-       sws_scale( context, (const uint8_t* const*) frame->data, frame->linesize, 0, height,
+       sws_scale(self->sws_context, (const uint8_t* const*) frame->data, frame->linesize, 0, height,
            out_data, out_stride);
-       sws_freeContext( context );
    }
    else if ( *format == mlt_image_rgb )
    {
        int flags = mlt_get_sws_flags(width, height, src_pix_fmt, width, height, AV_PIX_FMT_RGB24);
-       struct SwsContext *context = sws_getContext( width, height, src_pix_fmt,
+       self->sws_context = sws_getCachedContext(self->sws_context, width, height, src_pix_fmt,
            width, height, AV_PIX_FMT_RGB24, flags, NULL, NULL, NULL);
        uint8_t *out_data[4];
@@ -1536,13 +1534,12 @@ static int convert_image( producer_avformat self, AVFrame *frame, uint8_t *buffe
        av_image_fill_arrays(out_data, out_stride, buffer, AV_PIX_FMT_RGB24, width, height, IMAGE_ALIGN);
        // libswscale wants the RGB colorspace to be SWS_CS_DEFAULT, which is = SWS_CS_ITU601.
-       mlt_set_luma_transfer( context, self->yuv_colorspace, 601, self->full_range, 1 );
-       sws_scale( context, (const uint8_t* const*) frame->data, frame->linesize, 0, height,
+       mlt_set_luma_transfer(self->sws_context, self->yuv_colorspace, 601, self->full_range, 1);
+       sws_scale(self->sws_context, (const uint8_t* const*) frame->data, frame->linesize, 0, height,
            out_data, out_stride);
-       sws_freeContext( context );
    }
    else if ( *format == mlt_image_rgba )
    {
        int flags = mlt_get_sws_flags(width, height, src_pix_fmt, width, height, AV_PIX_FMT_RGBA);
-       struct SwsContext *context = sws_getContext( width, height, src_pix_fmt,
+       self->sws_context = sws_getCachedContext(self->sws_context, width, height, src_pix_fmt,
            width, height, AV_PIX_FMT_RGBA, flags, NULL, NULL, NULL);
        uint8_t *out_data[4];
@@ -1550,8 +1547,7 @@ static int convert_image( producer_avformat self, AVFrame *frame, uint8_t *buffe
        av_image_fill_arrays(out_data, out_stride, buffer, AV_PIX_FMT_RGBA, width, height, IMAGE_ALIGN);
        // libswscale wants the RGB colorspace to be SWS_CS_DEFAULT, which is = SWS_CS_ITU601.
-       mlt_set_luma_transfer( context, self->yuv_colorspace, 601, self->full_range, 1 );
-       sws_scale( context, (const uint8_t* const*) frame->data, frame->linesize, 0, height,
+       mlt_set_luma_transfer(self->sws_context, self->yuv_colorspace, 601, self->full_range, 1);
+       sws_scale(self->sws_context, (const uint8_t* const*) frame->data, frame->linesize, 0, height,
            out_data, out_stride);
-       sws_freeContext( context );
    }
    else
@@ -3193,4 +3189,5 @@ static void producer_avformat_close( producer_avformat self )
    av_frame_free( &self->video_frame );
    av_frame_free( &self->audio_frame );
+   sws_freeContext(self->sws_context);

 #if USE_HWACCEL

Tjoppen commented 1 year ago

What I would suggest is keeping the contexts around, not repeatedly getting and freeing them. Basically do the same thing as libavfilter/vf_scale.c. It's also possible to do the same thing as sws_scale_frame() does with the older API, and then let lavfi handle the threading. That would also cut down on the amount of code on your end :)

On avoiding memcpy()s, you said in #865 that mlt and ffmpeg organize its buffers differently. But they actually appear very similar except mlt frames lack padding. Bringing them in line seems worthwhile and would involve keeping track of strides. Our test machine uses DDR4-2666 and with 4k60 yuv422 there's time for at most 19 memcpy()s per frame..

edit: I did a quick test with the program below compiled with -O3 and the number comes out even lower as 7 memcpy()s per frame

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>

#define SZ (4096*2160*2) //4k yuv422

int main(void) {
  void *a = malloc(SZ), *b = malloc(SZ);
  for (int x = 0; x < 10000; x++) {
    if (x & 1) memcpy(a, b, SZ);
    else memcpy(b, a, SZ);
  }
  return 0;
}

$ gcc memcpytest.c -O3 -o memcpytest && time ./memcpytest 

real    0m22.588s
user    0m22.567s
sys     0m0.020s
$ octave <<< "10000/22.588/60"
ans = 7.3785

Tjoppen commented 1 year ago

Digging more into this the scaling becomes far less bad when using 1024x429 which is possible with 4:2:2. convert_image() also becomes needlessly expensive due to conversion from BT.709 to BT.601. Those are problems on our end.

The patch posted above by @ddennedy won't help when "converting" yuv422p to yuv422p since each call to sliced_h_pix_fmt_conv_proc() still gets its own context. But! It turns out sws_init_context() does a bunch of useless work because it always calls initFilter() even when no actual scaling is performed, such as yuv422p -> yuv422p or yuv422p10 -> yuv422p. This overhead is ~0.75% of all cycles according to callgrind, but in an effectively serial path, which with 72 threads turns out to be quite significant. For the test file in question with a warm disk cache the median wall time of three runs before and after is 14.164 vs 13.163 seconds for 100 seconds of essence. In practical terms this means 8% more responsive scrubbing in kdenlive.

TL;DR: a patch to libswscale will greatly lessen this particular problem. I'll see what I can do.

Tjoppen commented 1 year ago

If the yuv422p -> yuv422p can be avoided then there's potential to shave off another second: 12.160s, meaning 16% faster

Tjoppen commented 1 year ago

I pushed a patch for the initFilter() slowness to FFmpeg just now as a678b0c252

ddennedy commented 1 year ago

If the yuv422p -> yuv422p can be avoided

There is no yuv422p pixel format in MLT; closest is non-planar mlt_image_yuv422

Tjoppen commented 1 year ago

Yeah I may actually have meant whatever the corresponding nonplanar format in libavcodec is called.

mltframework / mlt