I'm interested in the interp_algo options in upstream FFmpeg, particularly bicubic and lanczos interpolation. I find that the aliasing from bilinear is noticeable regardless of the quality of the other settings. This is especially noticeable with more extreme scaling, like for thumbnails. I think this would be nice to have as an option in Jellyfin as well - there are a lot of people who'd be happy to use a bit more GPU power to get a higher quality result.
This patch removes the interp_algo option, keeping only bilinear. It seems this was written a few years ago, and the upstream scale_cuda implementation has added format conversion with this and this commit.
Is there a particular motivation for using this patch instead of the current upstream scale_cuda implementation?
a. Pixel format conversion: This is seemingly addressed now, but maybe there's a conversion that's supported in the patch but not upstream?
b. Performance: the patch uses a LUT while upstream does not. I ran some performance tests below and my environment doesn't see a speedup compared to interp_algo=bilinear. I only tested p010->p010, though.
If there is none, then would it be acceptable to remove these changes? I found that I only needed to keep dither_matrix.h from this patch for it to build. The default interp_algo could also be patched to bilinear to preserve the current behavior.
If the patch should be kept, would it be acceptable to add the interpolation options to this patch? I played around with this and could possibly make a PR for it, but I don't have a background in graphics processing so I might make mistakes.
For testing, I built jellyfin-ffmpeg with the scale_cuda patch stripped down to just dither_matrix.h and ran permutations of the following command with this video (downloaded in 1080p, 4K and 8K) on an RTX 4090:
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i reptiles.webm -vf scale_cuda=-2:4319=interp_algo=lanczos -f null -, where 4319 and lanczos vary by run depending on target resolution and interpolation. I use 4319 here to keep it more or less the same resolution but forcing it to actually interpolate.
For the current jellyfin-ffmpeg, I ran the same command but without interp_algo.
8K -> 8K
8K -> 1440p
4K -> 4K
4K -> 1440p
1080p -> 1080p*
Bilinear (Jellyfin)
fps=130 speed=2.18x
fps=131 speed=2.18x
fps=513 speed=8.56x
fps=519 speed=8.66x
fps=2000 speed=33.38x
Bilinear (upstream)
fps=131 speed=2.19x
fps=131 speed=2.19x
fps=533 speed= 8.9x
fps=530 speed=8.84x
fps=2043 speed=34.08x
Bicubic (upstream)
fps=133 speed=2.22x
fps=131 speed=2.18x
fps=533 speed= 8.9x
fps=519 speed=8.66x
fps=2016 speed=33.65x
Lanczos (upstream)
fps=130 speed=2.16x
fps=133 speed=2.22x
fps=515 speed=8.59x
fps=515 speed=8.59x
fps=2043 speed=34.08x
*I got run-to-run variation on this so ran it 6 times for each and averaged the results. I ran the others only twice because the second run was always within 1-2fps, sometimes the exact same.
The results confuse me a bit since I expected bicubic and lanczos to be slower. I hear a bit of coil whine for lanczos, but the FPS is similar across interpolators and utilization % is 95%+ for Video Decode and 3D (3D occasionally dipping to 60-ish%). It seems like decoding is generally the bottleneck in my case. I assume this is because the GPU is overkill and there will be a measurable difference for other GPUs. But upstream bilinear seems very slightly faster if anything compared to the current.
I'm interested in the
interp_algo
options in upstream FFmpeg, particularly bicubic and lanczos interpolation. I find that the aliasing from bilinear is noticeable regardless of the quality of the other settings. This is especially noticeable with more extreme scaling, like for thumbnails. I think this would be nice to have as an option in Jellyfin as well - there are a lot of people who'd be happy to use a bit more GPU power to get a higher quality result.This patch removes the
interp_algo
option, keeping only bilinear. It seems this was written a few years ago, and the upstreamscale_cuda
implementation has added format conversion with this and this commit.scale_cuda
implementation? a. Pixel format conversion: This is seemingly addressed now, but maybe there's a conversion that's supported in the patch but not upstream? b. Performance: the patch uses a LUT while upstream does not. I ran some performance tests below and my environment doesn't see a speedup compared tointerp_algo=bilinear
. I only tested p010->p010, though.dither_matrix.h
from this patch for it to build. The defaultinterp_algo
could also be patched to bilinear to preserve the current behavior.For testing, I built jellyfin-ffmpeg with the
scale_cuda
patch stripped down to justdither_matrix.h
and ran permutations of the following command with this video (downloaded in 1080p, 4K and 8K) on an RTX 4090:ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i reptiles.webm -vf scale_cuda=-2:4319=interp_algo=lanczos -f null -
, where4319
andlanczos
vary by run depending on target resolution and interpolation. I use 4319 here to keep it more or less the same resolution but forcing it to actually interpolate.For the current
jellyfin-ffmpeg
, I ran the same command but withoutinterp_algo
.*I got run-to-run variation on this so ran it 6 times for each and averaged the results. I ran the others only twice because the second run was always within 1-2fps, sometimes the exact same.
The results confuse me a bit since I expected bicubic and lanczos to be slower. I hear a bit of coil whine for lanczos, but the FPS is similar across interpolators and utilization % is 95%+ for Video Decode and 3D (3D occasionally dipping to 60-ish%). It seems like decoding is generally the bottleneck in my case. I assume this is because the GPU is overkill and there will be a measurable difference for other GPUs. But upstream bilinear seems very slightly faster if anything compared to the current.