[Test only] Test perf of tonemapx filter

nyanmisaka commented 2 weeks ago

Test only, do not merge.

gnattu commented 2 weeks ago

On Apple M1 Max, test with 4K input:

ffmpeg -noautorotate -i file:"input.mp4" -codec:v:0 libx264 -preset veryfast -crf 23 -maxrate 9871252 -bufsize 19742504 -x264opts:0 subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none -force_key_frames:0 "expr:gte(t,n_forced*3)" -sc_threshold:v:0 0 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,tonemapx=tonemap=reinhard:desat=0:peak=100:t=bt709:m=bt709:p=bt709:format=nv12" -codec:a:0 copy -copyts -avoid_negative_ts disabled test.mp4

Result:

tonemapx: frame= 2965 fps= 43 q=-1.0 Lsize= 106469kB time=00:01:38.58 bitrate=8847.4kbits/s speed= 1.43x

For comparison, zscale+simd optimized tonemap:

ffmpeg -noautorotate -i file:"input.mp4" -map_metadata -1 -map_chapters -1 -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 libx264 -preset veryfast -crf 23 -maxrate 9871252 -bufsize 19742504 -x264opts:0 subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none -force_key_frames:0 "expr:gte(t,n_forced*3)" -sc_threshold:v:0 0 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -codec:a:0 copy -copyts -avoid_negative_ts disabled test.mp4

Result:

frame= 2965 fps= 44 q=-1.0 Lsize= 105488kB time=00:01:38.58 bitrate=8766.0kbits/s speed=1.47x

Edit: my initial test used ffmpeg format filter, by using this filter's internal LUT the speed is comparable when actually encoding.

nyanmisaka commented 2 weeks ago

How do they perform without software decoder and encoder interfering?

ffmpeg -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf ... -f null -

This LUT tonemap filter currently only accepts p010/p016 input. For yuv420p10, ffmpeg will automatically insert a converter.

gnattu commented 2 weeks ago

Also on M1 Max:

zscale+tonemap simd:

Command:

./ffmpeg -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -f null -

Result:

frame= 2250 fps= 76 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=3.05x

tonemapx:

Command:

./ffmpeg -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,tonemapx=tonemap=reinhard:desat=0:peak=100:t=bt709:m=bt709:p=bt709:format=nv12" -f null -

Result:

frame= 2250 fps= 66 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=2.62x

So the perf difference is higher when we are only running the filter chain, but it is hard to see such difference in real world scenario due to the heavy compute load from the decoder and encoders.

nyanmisaka commented 2 weeks ago

Can adjusting -filter_threads improve performance on you end? It works for me on some processors.

To reflect the performance on an entry-level platform, I also ran it on RK3588(4xA76+4xA55). Somehow it doesn't like zscale+tonemap simd but tonemapx. Although it can't keep up with real-time speed at 4k.

zscale+tonemap simd:

./ffmpeg -filter_threads 12 -t 90 -f lavfi -i nullsrc=s=3840x2160,format=yuv420p10 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -f null -

frame= 2250 fps=4.2 q=-0.0 size=N/A time=00:01:29.96 bitrate=N/A speed=0.169x

tonemapx:

./ffmpeg -filter_threads 12 -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:col
or_trc=arib-std-b67:colorspace=bt2020nc,tonemapx=tonemap=reinhard:desat=0:peak=100:t=bt709:m=bt709:p=bt709:format=nv12" -f null -

frame= 2250 fps= 17 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=0.66x

gnattu commented 2 weeks ago

Can adjusting -filter_threads improve performance on you end? It works for me on some processors.

It improves perf for both on M1 Max, but zscale+tonemap simd is still marginally faster:

zscale+tonemap simd:

./ffmpeg -filter_threads 16 -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=none:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -f null -

frame= 2250 fps= 96 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=3.83x

tonemapx:

./ffmpeg -filter_threads 16 -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,tonemapx=tonemap=none:desat=0:peak=100:t=bt709:m=bt709:p=bt709:format=nv12" -f null -

frame= 2250 fps= 91 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=3.66x

gnattu commented 2 weeks ago

To reflect the performance on an entry-level platform, I also ran it on RK3588(4xA76+4xA55). Somehow it doesn't like zscale+tonemap simd but tonemapx. Although it can't keep up with real-time speed at 4k.

This is a per-cpu behavior. If a CPU core's SIMD compute is very slow or resources limited(like being bandwidth constrained), it can hardly beat a static LUT especially under heavy thread pressure. On such platform use GPU accelerated tonemapping is highly recommended.

The need for software tone mapping mainly arises from the usage of trickplay and keyframe-only extraction, which works with a very limited amount of hardware. We run trickplay with only one thread by default, and under low thread pressure, SIMD tone mapping generally performs faster than scalar LUT tone mapping for low-cost reinhard algorithm.

nyanmisaka commented 2 weeks ago

Make sense. It would be perfect if we could combine the advantages of both. With intrinsics optimitzed LUT, various tonemap methods and dovi reshaping can be accelerated. Even mid-range processors can handle one or two 4k24fps transcoding w/ sw tonemap enabled.

zimg already has it, but they have not yet implemented tonemapping. We can try to contribute code or borrow some from it.

gnattu commented 2 weeks ago

With intrinsics optimitzed LUT, various tonemap methods and dovi reshaping can be accelerated.

The actual acceleration part probably is not the table lookup, but the post processing like multiply the scale, adding the offset and clamp to valid range. Probably the table lookup is slightly faster by loading a full 128bit packed 4 floats instead of loading four 32bit floats but the gain should be less than perform the multiplication and addition fused into one.

Also I'm a little bit concerned about doing the DOVI LUT due to its high dimension nature and the need of multiple tables for a full reshaping. You need to query each pivot and use multiple inputs to get the reshaped output. Such table would be huge and slow to lookup, and probably won't fit into a lot of CPU's L2 cache. If I'm going to do it I probably would just do the simd matrix operations instead of implementing LUT.

gnattu commented 1 week ago

Do you know how to convert the int16 color into the one we are expecting in the 0.0-1.0 range during dovi reshaping? My current "naive" implementation does not produce correct result:

inline static float dot(const float* x, const float* y, int len)
{
    int i;
    float result = 0;
    for (i = 0; i < len; i++) {
        result += x[i] * y[i];
    }
    return result;
}

inline static float reshape_poly(float s, float* coeffs) {
    return (coeffs[2] * s + coeffs[1]) * s + coeffs[0];
}

static float reshape_mmr(const float* sig, const float* coeffs, const struct ReshapeData *comp, int pivot_index) {
    int min_order = 3, max_order = 1;
    int order = (int)coeffs[3];
    float s = coeffs[0];
    float sigX[7] = {sig[0], sig[1], sig[2],
                     sig[0] * sig[1], sig[0] * sig[2], sig[1] * sig[2], sig[0] * sig[1] * sig[2]};
    min_order = FFMIN(min_order, comp->mmr_order[pivot_index]);
    max_order = FFMAX(max_order, comp->mmr_order[pivot_index]);

    s += dot(comp->mmr_coeffs[pivot_index][0], sigX, 7);

    if (max_order >= 2 && (min_order >= 2 || order >= 2)) {
        float sigX2[7] = {sig[0] * sig[0], sig[1] * sig[1], sig[2] * sig[2],
                          sigX[0] * sigX[0], sigX[1] * sigX[1], sigX[2] * sigX[2], sigX[3] * sigX[3]};
        s += dot(comp->mmr_coeffs[pivot_index][1], sigX2, 7);

        if (max_order == 3 && (min_order == 3 || order >= 3)) {
            float sigX3[7] = {sig[0] * sig[0] * sig[0], sig[1] * sig[1] * sig[1], sig[2] * sig[2] * sig[2],
                              sigX2[0] * sigX[0], sigX2[1] * sigX[1], sigX2[2] * sigX[2], sigX2[3] * sigX[3]};
            s += dot(comp->mmr_coeffs[pivot_index][2], sigX3, 7);
        }
    }

    return s;
}

inline static void ycc2rgb(float* dest, float y, float cb, float cr, const double nonlinear[3][3], const float* ycc2rgb_offset)
{
    float offset1 = ycc2rgb_offset[0] * (float)nonlinear[0][0] + ycc2rgb_offset[1] * (float)nonlinear[0][1] + ycc2rgb_offset[2] * (float)nonlinear[0][2];
    float offset2 = ycc2rgb_offset[0] * (float)nonlinear[1][0] + ycc2rgb_offset[1] * (float)nonlinear[1][1] + ycc2rgb_offset[2] * (float)nonlinear[1][2];
    float offset3 = ycc2rgb_offset[0] * (float)nonlinear[2][0] + ycc2rgb_offset[1] * (float)nonlinear[2][1] + ycc2rgb_offset[2] * (float)nonlinear[2][2];

    dest[0] = (y * (float)nonlinear[0][0] + cb * (float)nonlinear[0][1] + cr * (float)nonlinear[0][2]) - offset1;
    dest[1] = (y * (float)nonlinear[1][0] + cb * (float)nonlinear[1][1] + cr * (float)nonlinear[1][2]) - offset2;
    dest[2] = (y * (float)nonlinear[2][0] + cb * (float)nonlinear[2][1] + cr * (float)nonlinear[2][2]) - offset3;
}

// This implementation does not do the costly linearization and de-linearization for performance reasons
// The output color accuracy will be affected due to this
inline static void lms2rgb(float* dest, float l, float m, float s, const double linear[3][3])
{
    double lms2rgb_matrix[3][3];
    ff_matrix_mul_3x3(lms2rgb_matrix, dovi_lms2rgb_matrix, linear);
    dest[0] = l * (float)lms2rgb_matrix[0][0] + m * (float)lms2rgb_matrix[0][1] + s * (float)lms2rgb_matrix[0][2];
    dest[1] = l * (float)lms2rgb_matrix[1][0] + m * (float)lms2rgb_matrix[1][1] + s * (float)lms2rgb_matrix[1][2];
    dest[2] = l * (float)lms2rgb_matrix[2][0] + m * (float)lms2rgb_matrix[2][1] + s * (float)lms2rgb_matrix[2][2];
}

#define CLAMP(a, b, c) (FFMIN(FFMAX((a), (b)), (c)))
static void reshape_dovi_yuv(float* dest, float* src, TonemapxContext *ctx)
{
    int i, k;
    float s;
    float coeffs[4] = {0, 0, 0, 0};

    float sig_arr[3] = {CLAMP((src[0] - 2048.0f) / 28672.0f, 0.0f, 1.0f), CLAMP((src[1] - 2048.0f) / 28672.0f, 0.0f, 1.0f), CLAMP((src[2] - 2048.0f) / 28672.0f, 0.0f, 1.0f)};
    for (i = 0; i < 3; i++) {
        const struct ReshapeData *comp = &ctx->dovi->comp[i];
        s = sig_arr[i];
        if (comp->num_pivots >= 9 && s >= comp->pivots[7]) {
            switch (comp->method[7]) {
                case 0: // polynomial
                    coeffs[3] = 0.0f; // order=0 signals polynomial
                    for (k = 0; k < 3; k++)
                        coeffs[k] = comp->poly_coeffs[7][k];
                    s = reshape_poly(s, coeffs);
                    break;
                case 1:
                    coeffs[0] = comp->mmr_constant[7];
                    coeffs[1] = (float)(2 * i);
                    coeffs[3] = (float)comp->mmr_order[7];
                    s = reshape_mmr(sig_arr, coeffs, comp, 7);
                    break;
            }
        } else if (comp->num_pivots >= 8 && s >= comp->pivots[6]) {
            switch (comp->method[6]) {
                case 0: // polynomial
                    coeffs[3] = 0.0f; // order=0 signals polynomial
                    for (k = 0; k < 3; k++)
                        coeffs[k] = comp->poly_coeffs[6][k];
                    s = reshape_poly(s, coeffs);
                    break;
                case 1:
                    coeffs[0] = comp->mmr_constant[6];
                    coeffs[1] = (float)(2 * i);
                    coeffs[3] = (float)comp->mmr_order[6];
                    s = reshape_mmr(sig_arr, coeffs, comp, 6);
                    break;
            }
        } else if (comp->num_pivots >= 7 && s >= comp->pivots[5]) {
            switch (comp->method[5]) {
                case 0: // polynomial
                    coeffs[3] = 0.0f; // order=0 signals polynomial
                    for (k = 0; k < 3; k++)
                        coeffs[k] = comp->poly_coeffs[5][k];
                    s = reshape_poly(s, coeffs);
                    break;
                case 1:
                    coeffs[0] = comp->mmr_constant[5];
                    coeffs[1] = (float)(2 * i);
                    coeffs[3] = (float)comp->mmr_order[5];
                    s = reshape_mmr(sig_arr, coeffs, comp, 5);
                    break;
            }
        } else if (comp->num_pivots >= 6 && s >= comp->pivots[4]) {
            switch (comp->method[4]) {
                case 0: // polynomial
                    coeffs[3] = 0.0f; // order=0 signals polynomial
                    for (k = 0; k < 3; k++)
                        coeffs[k] = comp->poly_coeffs[4][k];
                    s = reshape_poly(s, coeffs);
                    break;
                case 1:
                    coeffs[0] = comp->mmr_constant[4];
                    coeffs[1] = (float)(2 * i);
                    coeffs[3] = (float)comp->mmr_order[4];
                    s = reshape_mmr(sig_arr, coeffs, comp, 4);
                    break;
            }
        } else if (comp->num_pivots >= 5 && s >= comp->pivots[3]) {
            switch (comp->method[3]) {
                case 0: // polynomial
                    coeffs[3] = 0.0f; // order=0 signals polynomial
                    for (k = 0; k < 3; k++)
                        coeffs[k] = comp->poly_coeffs[3][k];
                    s = reshape_poly(s, coeffs);
                    break;
                case 1:
                    coeffs[0] = comp->mmr_constant[3];
                    coeffs[1] = (float)(2 * i);
                    coeffs[3] = (float)comp->mmr_order[3];
                    s = reshape_mmr(sig_arr, coeffs, comp, 3);
                    break;
            }
        } else if (comp->num_pivots >= 4 && s >= comp->pivots[2]) {
            switch (comp->method[2]) {
                case 0: // polynomial
                    coeffs[3] = 0.0f; // order=0 signals polynomial
                    for (k = 0; k < 3; k++)
                        coeffs[k] = comp->poly_coeffs[2][k];
                    s = reshape_poly(s, coeffs);
                    break;
                case 1:
                    coeffs[0] = comp->mmr_constant[2];
                    coeffs[1] = (float)(2 * i);
                    coeffs[3] = (float)comp->mmr_order[2];
                    s = reshape_mmr(sig_arr, coeffs, comp, 2);
                    break;
            }
        } else if (comp->num_pivots >= 3 && s >= comp->pivots[1]) {
            switch (comp->method[1]) {
                case 0: // polynomial
                    coeffs[3] = 0.0f; // order=0 signals polynomial
                    for (k = 0; k < 3; k++)
                        coeffs[k] = comp->poly_coeffs[1][k];
                    s = reshape_poly(s, coeffs);
                    break;
                case 1:
                    coeffs[0] = comp->mmr_constant[1];
                    coeffs[1] = (float)(2 * i);
                    coeffs[3] = (float)comp->mmr_order[1];
                    s = reshape_mmr(sig_arr, coeffs, comp, 1);
                    break;
            }
        } else {
            switch (comp->method[0]) {
                case 0: // polynomial
                    coeffs[3] = 0.0f; // order=0 signals polynomial
                    for (k = 0; k < 3; k++)
                        coeffs[k] = comp->poly_coeffs[0][k];
                    s = reshape_poly(s, coeffs);
                    break;
                case 1:
                    coeffs[0] = comp->mmr_constant[0];
                    coeffs[1] = (float)(2 * i);
                    coeffs[3] = (float)comp->mmr_order[0];
                    s = reshape_mmr(sig_arr, coeffs, comp, 0);
                    break;
            }
        }
        sig_arr[i] = CLAMP(s, comp->pivots[0], comp->pivots[comp->num_pivots-1]);
    }
    *dest = *sig_arr;
}

nyanmisaka commented 1 week ago

Can you push your patch to this branch, and let's see what is going on? The signed -> unorm normalization is usually done by hardware on GPGPU, and I guess the difference in the way it's calculated affects the accuracy.

gnattu commented 1 week ago

Pushed the WIP branch: https://github.com/jellyfin/jellyfin-ffmpeg/pull/404

This hardcoded to use dovi reshaping so you will need to use a dovi input.

jellyfin / jellyfin-ffmpeg