Open nyanmisaka opened 2 weeks ago
On Apple M1 Max, test with 4K input:
ffmpeg -noautorotate -i file:"input.mp4" -codec:v:0 libx264 -preset veryfast -crf 23 -maxrate 9871252 -bufsize 19742504 -x264opts:0 subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none -force_key_frames:0 "expr:gte(t,n_forced*3)" -sc_threshold:v:0 0 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,tonemapx=tonemap=reinhard:desat=0:peak=100:t=bt709:m=bt709:p=bt709:format=nv12" -codec:a:0 copy -copyts -avoid_negative_ts disabled test.mp4
Result:
tonemapx: frame= 2965 fps= 43 q=-1.0 Lsize= 106469kB time=00:01:38.58 bitrate=8847.4kbits/s speed= 1.43x
For comparison, zscale
+simd optimized tonemap
:
ffmpeg -noautorotate -i file:"input.mp4" -map_metadata -1 -map_chapters -1 -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 libx264 -preset veryfast -crf 23 -maxrate 9871252 -bufsize 19742504 -x264opts:0 subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none -force_key_frames:0 "expr:gte(t,n_forced*3)" -sc_threshold:v:0 0 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -codec:a:0 copy -copyts -avoid_negative_ts disabled test.mp4
Result:
frame= 2965 fps= 44 q=-1.0 Lsize= 105488kB time=00:01:38.58 bitrate=8766.0kbits/s speed=1.47x
Edit: my initial test used ffmpeg format filter, by using this filter's internal LUT the speed is comparable when actually encoding.
How do they perform without software decoder and encoder interfering?
ffmpeg -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf ... -f null -
This LUT tonemap filter currently only accepts p010/p016 input. For yuv420p10, ffmpeg will automatically insert a converter.
Also on M1 Max:
Command:
./ffmpeg -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -f null -
Result:
frame= 2250 fps= 76 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=3.05x
Command:
./ffmpeg -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,tonemapx=tonemap=reinhard:desat=0:peak=100:t=bt709:m=bt709:p=bt709:format=nv12" -f null -
Result:
frame= 2250 fps= 66 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=2.62x
So the perf difference is higher when we are only running the filter chain, but it is hard to see such difference in real world scenario due to the heavy compute load from the decoder and encoders.
Can adjusting -filter_threads
improve performance on you end? It works for me on some processors.
To reflect the performance on an entry-level platform, I also ran it on RK3588(4xA76+4xA55). Somehow it doesn't like zscale+tonemap simd but tonemapx. Although it can't keep up with real-time speed at 4k.
zscale+tonemap simd:
./ffmpeg -filter_threads 12 -t 90 -f lavfi -i nullsrc=s=3840x2160,format=yuv420p10 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -f null -
frame= 2250 fps=4.2 q=-0.0 size=N/A time=00:01:29.96 bitrate=N/A speed=0.169x
tonemapx:
./ffmpeg -filter_threads 12 -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:col
or_trc=arib-std-b67:colorspace=bt2020nc,tonemapx=tonemap=reinhard:desat=0:peak=100:t=bt709:m=bt709:p=bt709:format=nv12" -f null -
frame= 2250 fps= 17 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=0.66x
Can adjusting -filter_threads improve performance on you end? It works for me on some processors.
It improves perf for both on M1 Max, but zscale+tonemap simd is still marginally faster:
./ffmpeg -filter_threads 16 -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=none:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -f null -
frame= 2250 fps= 96 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=3.83x
./ffmpeg -filter_threads 16 -t 90 -f lavfi -i nullsrc=s=3840x2160,format=p010 -vf "setparams=color_primaries=bt2020:color_trc=arib-std-b67:colorspace=bt2020nc,tonemapx=tonemap=none:desat=0:peak=100:t=bt709:m=bt709:p=bt709:format=nv12" -f null -
frame= 2250 fps= 91 q=-0.0 Lsize=N/A time=00:01:29.96 bitrate=N/A speed=3.66x
To reflect the performance on an entry-level platform, I also ran it on RK3588(4xA76+4xA55). Somehow it doesn't like zscale+tonemap simd but tonemapx. Although it can't keep up with real-time speed at 4k.
This is a per-cpu behavior. If a CPU core's SIMD compute is very slow or resources limited(like being bandwidth constrained), it can hardly beat a static LUT especially under heavy thread pressure. On such platform use GPU accelerated tonemapping is highly recommended.
The need for software tone mapping mainly arises from the usage of trickplay and keyframe-only extraction, which works with a very limited amount of hardware. We run trickplay with only one thread by default, and under low thread pressure, SIMD tone mapping generally performs faster than scalar LUT tone mapping for low-cost reinhard algorithm.
Make sense. It would be perfect if we could combine the advantages of both. With intrinsics optimitzed LUT, various tonemap methods and dovi reshaping can be accelerated. Even mid-range processors can handle one or two 4k24fps transcoding w/ sw tonemap enabled.
zimg already has it, but they have not yet implemented tonemapping. We can try to contribute code or borrow some from it.
With intrinsics optimitzed LUT, various tonemap methods and dovi reshaping can be accelerated.
The actual acceleration part probably is not the table lookup, but the post processing like multiply the scale, adding the offset and clamp to valid range. Probably the table lookup is slightly faster by loading a full 128bit packed 4 floats instead of loading four 32bit floats but the gain should be less than perform the multiplication and addition fused into one.
Also I'm a little bit concerned about doing the DOVI LUT due to its high dimension nature and the need of multiple tables for a full reshaping. You need to query each pivot and use multiple inputs to get the reshaped output. Such table would be huge and slow to lookup, and probably won't fit into a lot of CPU's L2 cache. If I'm going to do it I probably would just do the simd matrix operations instead of implementing LUT.
Do you know how to convert the int16 color into the one we are expecting in the 0.0-1.0 range during dovi reshaping? My current "naive" implementation does not produce correct result:
inline static float dot(const float* x, const float* y, int len)
{
int i;
float result = 0;
for (i = 0; i < len; i++) {
result += x[i] * y[i];
}
return result;
}
inline static float reshape_poly(float s, float* coeffs) {
return (coeffs[2] * s + coeffs[1]) * s + coeffs[0];
}
static float reshape_mmr(const float* sig, const float* coeffs, const struct ReshapeData *comp, int pivot_index) {
int min_order = 3, max_order = 1;
int order = (int)coeffs[3];
float s = coeffs[0];
float sigX[7] = {sig[0], sig[1], sig[2],
sig[0] * sig[1], sig[0] * sig[2], sig[1] * sig[2], sig[0] * sig[1] * sig[2]};
min_order = FFMIN(min_order, comp->mmr_order[pivot_index]);
max_order = FFMAX(max_order, comp->mmr_order[pivot_index]);
s += dot(comp->mmr_coeffs[pivot_index][0], sigX, 7);
if (max_order >= 2 && (min_order >= 2 || order >= 2)) {
float sigX2[7] = {sig[0] * sig[0], sig[1] * sig[1], sig[2] * sig[2],
sigX[0] * sigX[0], sigX[1] * sigX[1], sigX[2] * sigX[2], sigX[3] * sigX[3]};
s += dot(comp->mmr_coeffs[pivot_index][1], sigX2, 7);
if (max_order == 3 && (min_order == 3 || order >= 3)) {
float sigX3[7] = {sig[0] * sig[0] * sig[0], sig[1] * sig[1] * sig[1], sig[2] * sig[2] * sig[2],
sigX2[0] * sigX[0], sigX2[1] * sigX[1], sigX2[2] * sigX[2], sigX2[3] * sigX[3]};
s += dot(comp->mmr_coeffs[pivot_index][2], sigX3, 7);
}
}
return s;
}
inline static void ycc2rgb(float* dest, float y, float cb, float cr, const double nonlinear[3][3], const float* ycc2rgb_offset)
{
float offset1 = ycc2rgb_offset[0] * (float)nonlinear[0][0] + ycc2rgb_offset[1] * (float)nonlinear[0][1] + ycc2rgb_offset[2] * (float)nonlinear[0][2];
float offset2 = ycc2rgb_offset[0] * (float)nonlinear[1][0] + ycc2rgb_offset[1] * (float)nonlinear[1][1] + ycc2rgb_offset[2] * (float)nonlinear[1][2];
float offset3 = ycc2rgb_offset[0] * (float)nonlinear[2][0] + ycc2rgb_offset[1] * (float)nonlinear[2][1] + ycc2rgb_offset[2] * (float)nonlinear[2][2];
dest[0] = (y * (float)nonlinear[0][0] + cb * (float)nonlinear[0][1] + cr * (float)nonlinear[0][2]) - offset1;
dest[1] = (y * (float)nonlinear[1][0] + cb * (float)nonlinear[1][1] + cr * (float)nonlinear[1][2]) - offset2;
dest[2] = (y * (float)nonlinear[2][0] + cb * (float)nonlinear[2][1] + cr * (float)nonlinear[2][2]) - offset3;
}
// This implementation does not do the costly linearization and de-linearization for performance reasons
// The output color accuracy will be affected due to this
inline static void lms2rgb(float* dest, float l, float m, float s, const double linear[3][3])
{
double lms2rgb_matrix[3][3];
ff_matrix_mul_3x3(lms2rgb_matrix, dovi_lms2rgb_matrix, linear);
dest[0] = l * (float)lms2rgb_matrix[0][0] + m * (float)lms2rgb_matrix[0][1] + s * (float)lms2rgb_matrix[0][2];
dest[1] = l * (float)lms2rgb_matrix[1][0] + m * (float)lms2rgb_matrix[1][1] + s * (float)lms2rgb_matrix[1][2];
dest[2] = l * (float)lms2rgb_matrix[2][0] + m * (float)lms2rgb_matrix[2][1] + s * (float)lms2rgb_matrix[2][2];
}
#define CLAMP(a, b, c) (FFMIN(FFMAX((a), (b)), (c)))
static void reshape_dovi_yuv(float* dest, float* src, TonemapxContext *ctx)
{
int i, k;
float s;
float coeffs[4] = {0, 0, 0, 0};
float sig_arr[3] = {CLAMP((src[0] - 2048.0f) / 28672.0f, 0.0f, 1.0f), CLAMP((src[1] - 2048.0f) / 28672.0f, 0.0f, 1.0f), CLAMP((src[2] - 2048.0f) / 28672.0f, 0.0f, 1.0f)};
for (i = 0; i < 3; i++) {
const struct ReshapeData *comp = &ctx->dovi->comp[i];
s = sig_arr[i];
if (comp->num_pivots >= 9 && s >= comp->pivots[7]) {
switch (comp->method[7]) {
case 0: // polynomial
coeffs[3] = 0.0f; // order=0 signals polynomial
for (k = 0; k < 3; k++)
coeffs[k] = comp->poly_coeffs[7][k];
s = reshape_poly(s, coeffs);
break;
case 1:
coeffs[0] = comp->mmr_constant[7];
coeffs[1] = (float)(2 * i);
coeffs[3] = (float)comp->mmr_order[7];
s = reshape_mmr(sig_arr, coeffs, comp, 7);
break;
}
} else if (comp->num_pivots >= 8 && s >= comp->pivots[6]) {
switch (comp->method[6]) {
case 0: // polynomial
coeffs[3] = 0.0f; // order=0 signals polynomial
for (k = 0; k < 3; k++)
coeffs[k] = comp->poly_coeffs[6][k];
s = reshape_poly(s, coeffs);
break;
case 1:
coeffs[0] = comp->mmr_constant[6];
coeffs[1] = (float)(2 * i);
coeffs[3] = (float)comp->mmr_order[6];
s = reshape_mmr(sig_arr, coeffs, comp, 6);
break;
}
} else if (comp->num_pivots >= 7 && s >= comp->pivots[5]) {
switch (comp->method[5]) {
case 0: // polynomial
coeffs[3] = 0.0f; // order=0 signals polynomial
for (k = 0; k < 3; k++)
coeffs[k] = comp->poly_coeffs[5][k];
s = reshape_poly(s, coeffs);
break;
case 1:
coeffs[0] = comp->mmr_constant[5];
coeffs[1] = (float)(2 * i);
coeffs[3] = (float)comp->mmr_order[5];
s = reshape_mmr(sig_arr, coeffs, comp, 5);
break;
}
} else if (comp->num_pivots >= 6 && s >= comp->pivots[4]) {
switch (comp->method[4]) {
case 0: // polynomial
coeffs[3] = 0.0f; // order=0 signals polynomial
for (k = 0; k < 3; k++)
coeffs[k] = comp->poly_coeffs[4][k];
s = reshape_poly(s, coeffs);
break;
case 1:
coeffs[0] = comp->mmr_constant[4];
coeffs[1] = (float)(2 * i);
coeffs[3] = (float)comp->mmr_order[4];
s = reshape_mmr(sig_arr, coeffs, comp, 4);
break;
}
} else if (comp->num_pivots >= 5 && s >= comp->pivots[3]) {
switch (comp->method[3]) {
case 0: // polynomial
coeffs[3] = 0.0f; // order=0 signals polynomial
for (k = 0; k < 3; k++)
coeffs[k] = comp->poly_coeffs[3][k];
s = reshape_poly(s, coeffs);
break;
case 1:
coeffs[0] = comp->mmr_constant[3];
coeffs[1] = (float)(2 * i);
coeffs[3] = (float)comp->mmr_order[3];
s = reshape_mmr(sig_arr, coeffs, comp, 3);
break;
}
} else if (comp->num_pivots >= 4 && s >= comp->pivots[2]) {
switch (comp->method[2]) {
case 0: // polynomial
coeffs[3] = 0.0f; // order=0 signals polynomial
for (k = 0; k < 3; k++)
coeffs[k] = comp->poly_coeffs[2][k];
s = reshape_poly(s, coeffs);
break;
case 1:
coeffs[0] = comp->mmr_constant[2];
coeffs[1] = (float)(2 * i);
coeffs[3] = (float)comp->mmr_order[2];
s = reshape_mmr(sig_arr, coeffs, comp, 2);
break;
}
} else if (comp->num_pivots >= 3 && s >= comp->pivots[1]) {
switch (comp->method[1]) {
case 0: // polynomial
coeffs[3] = 0.0f; // order=0 signals polynomial
for (k = 0; k < 3; k++)
coeffs[k] = comp->poly_coeffs[1][k];
s = reshape_poly(s, coeffs);
break;
case 1:
coeffs[0] = comp->mmr_constant[1];
coeffs[1] = (float)(2 * i);
coeffs[3] = (float)comp->mmr_order[1];
s = reshape_mmr(sig_arr, coeffs, comp, 1);
break;
}
} else {
switch (comp->method[0]) {
case 0: // polynomial
coeffs[3] = 0.0f; // order=0 signals polynomial
for (k = 0; k < 3; k++)
coeffs[k] = comp->poly_coeffs[0][k];
s = reshape_poly(s, coeffs);
break;
case 1:
coeffs[0] = comp->mmr_constant[0];
coeffs[1] = (float)(2 * i);
coeffs[3] = (float)comp->mmr_order[0];
s = reshape_mmr(sig_arr, coeffs, comp, 0);
break;
}
}
sig_arr[i] = CLAMP(s, comp->pivots[0], comp->pivots[comp->num_pivots-1]);
}
*dest = *sig_arr;
}
Can you push your patch to this branch, and let's see what is going on? The signed -> unorm
normalization is usually done by hardware on GPGPU, and I guess the difference in the way it's calculated affects the accuracy.
Pushed the WIP branch: https://github.com/jellyfin/jellyfin-ffmpeg/pull/404
This hardcoded to use dovi reshaping so you will need to use a dovi input.
[Test only] Test perf of tonemapx filter
Test only, do not merge.