jellyfin / jellyfin-ffmpeg

FFmpeg for Jellyfin
https://jellyfin.org
Other
477 stars 127 forks source link

avfilter/tonemap: add simd implementation for sse and neon #401

Closed gnattu closed 3 months ago

gnattu commented 3 months ago

Currently only reinhard, linear and none has simd implmentation, all other methods will fallback to scaler implementation.

Reinhard is the preferred way on CPU because it is fast and produces subjectively satisfactory outputs as the result tend to look brighter.

Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast and reinhard method:

Apple M1 Max:

tonemap.neon: 44fps tonemap.c: 35fps

Intel Core i9-12900:

tonemap.sse: 40fps tonemap.c: 32fps

Both resulted in ~25% perf gain.

Changes

Issues

gnattu commented 3 months ago

AVX implementation was also attempted but there is no measurable perf gain. I dropped that draft to simply the logic.

nyanmisaka commented 3 months ago

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

gnattu commented 3 months ago

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

zscale does color space conversion and linearization very fast as it is already using SIMD-optimized LUT so the scaler filter can hardly beat that. What we can do with that draft is to implement dovi reshaping and use that for dovi inputs, and we may even only implement the reshaping part so that we can pipe it into zscale for linearization and then do tonemap with this filter.

The dovi reshaping part has a lot of simd optimization opportunities as there are a lot of matrix operations. Compute power of floats is also a time-consuming task which means an SIMD optimized LUT is a must for CPU. This is also the reason why BT2390 is not an easy task on CPU.

nyanmisaka commented 3 months ago

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

zscale does color space conversion and linearization very fast as it is already using SIMD-optimized LUT so the scaler filter can hardly beat that. What we can do with that draft is to implement dovi reshaping and use that for dovi inputs, and we may even only implement the reshaping part so that we can pipe it into zscale for linearization and then do tonemap with this filter.

The dovi reshaping part has a lot of simd optimization opportunities as there are a lot of matrix operations. Compute power of floats is also a time-consuming task which means an SIMD optimized LUT is a must for CPU. This is also the reason why BT2390 is not an easy task on CPU.

What ffmpeg command did you use to test zscale+tonemap? It seems difficult to add LUT support for dovi reshaping, and libplacebo doesn't do it either.

gnattu commented 3 months ago

What ffmpeg command did you use to test zscale+tonemap?

Full command:

/path/to/ffmpeg -noautorotate -i file:"/path/to/input.mp4" -codec:v:0 libx264 -preset veryfast -crf 23 -maxrate 9871252 -bufsize 19742504 -x264opts:0 subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none -force_key_frames:0 "expr:gte(t,n_forced*3)" -sc_threshold:v:0 0 -vf "setparams=color_primaries=bt2020:color_trc=smpte2084:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -codec:a:0 copy -copyts -avoid_negative_ts disabled test.mp4

On some processor and input video combination, you need to reduce the -thread to a lower number like 1 to see the actual perf improvements with SIMD optimization introduced in this PR. My guess would be that having high thread pressure made the cache hit rate low enough.

It seems difficult to add LUT support for dovi reshaping, and libplacebo doesn't do it either.

It is also fine. This PR does not add LUT either, it just computes multiple pixels with SIMD at the same time and that's why reinhard is used. We can do the same with dovi reshaping.

gnattu commented 3 months ago

Closed in favor of #407