jellyfin / jellyfin-ffmpeg

FFmpeg for Jellyfin
https://jellyfin.org
Other
491 stars 129 forks source link

Question about `scale_cuda` interpolation #491

Open mertalev opened 5 hours ago

mertalev commented 5 hours ago

I'm interested in the interp_algo options in upstream FFmpeg, particularly bicubic and lanczos interpolation. I find that the aliasing from bilinear is noticeable regardless of the quality of the other settings. This is especially noticeable with more extreme scaling, like for thumbnails. I think this would be nice to have as an option in Jellyfin as well - there are a lot of people who'd be happy to use a bit more GPU power to get a higher quality result.

This patch removes the interp_algo option, keeping only bilinear. It seems this was written a few years ago, and the upstream scale_cuda implementation has added format conversion with this and this commit.

  1. Is there a particular motivation for using this patch instead of the current upstream scale_cuda implementation? a. Pixel format conversion: This is seemingly addressed now, but maybe there's a conversion that's supported in the patch but not upstream? b. Performance: the patch uses a LUT while upstream does not. I ran some performance tests below and my environment doesn't see a speedup compared to interp_algo=bilinear. I only tested p010->p010, though.
  2. If there is none, then would it be acceptable to remove these changes? I found that I only needed to keep dither_matrix.h from this patch for it to build. The default interp_algo could also be patched to bilinear to preserve the current behavior.
  3. If the patch should be kept, would it be acceptable to add the interpolation options to this patch? I played around with this and could possibly make a PR for it, but I don't have a background in graphics processing so I might make mistakes.

For testing, I built jellyfin-ffmpeg with the scale_cuda patch stripped down to just dither_matrix.h and ran permutations of the following command with this video (downloaded in 1080p, 4K and 8K) on an RTX 4090: ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i reptiles.webm -vf scale_cuda=-2:4319=interp_algo=lanczos -f null -, where 4319 and lanczos vary by run depending on target resolution and interpolation. I use 4319 here to keep it more or less the same resolution but forcing it to actually interpolate.

For the current jellyfin-ffmpeg, I ran the same command but without interp_algo.

8K -> 8K 8K -> 1440p 4K -> 4K 4K -> 1440p 1080p -> 1080p*
Bilinear (Jellyfin) fps=130 speed=2.18x fps=131 speed=2.18x fps=513 speed=8.56x fps=519 speed=8.66x fps=2000 speed=33.38x
Bilinear (upstream) fps=131 speed=2.19x fps=131 speed=2.19x fps=533 speed= 8.9x fps=530 speed=8.84x fps=2043 speed=34.08x
Bicubic (upstream) fps=133 speed=2.22x fps=131 speed=2.18x fps=533 speed= 8.9x fps=519 speed=8.66x fps=2016 speed=33.65x
Lanczos (upstream) fps=130 speed=2.16x fps=133 speed=2.22x fps=515 speed=8.59x fps=515 speed=8.59x fps=2043 speed=34.08x

*I got run-to-run variation on this so ran it 6 times for each and averaged the results. I ran the others only twice because the second run was always within 1-2fps, sometimes the exact same.

The results confuse me a bit since I expected bicubic and lanczos to be slower. I hear a bit of coil whine for lanczos, but the FPS is similar across interpolators and utilization % is 95%+ for Video Decode and 3D (3D occasionally dipping to 60-ish%). It seems like decoding is generally the bottleneck in my case. I assume this is because the GPU is overkill and there will be a measurable difference for other GPUs. But upstream bilinear seems very slightly faster if anything compared to the current.

gnattu commented 3 hours ago

Your GPU is just too fast for such tasks where you won't see any (meaningful) perf differences.