Question about `scale_cuda` interpolation

I'm interested in the interp_algo options in upstream FFmpeg, particularly bicubic and lanczos interpolation. I find that the aliasing from bilinear is noticeable regardless of the quality of the other settings. This is especially noticeable with more extreme scaling, like for thumbnails. I think this would be nice to have as an option in Jellyfin as well - there are a lot of people who'd be happy to use a bit more GPU power to get a higher quality result.

This patch removes the interp_algo option, keeping only bilinear. It seems this was written a few years ago, and the upstream scale_cuda implementation has added format conversion with this and this commit.

Is there a particular motivation for using this patch instead of the current upstream scale_cuda implementation? a. Pixel format conversion: This is seemingly addressed now, but maybe there's a conversion that's supported in the patch but not upstream? b. Performance: the patch uses a LUT while upstream does not. I ran some performance tests below and my environment doesn't see a speedup compared to interp_algo=bilinear. I only tested p010->p010, though.
If there is none, then would it be acceptable to remove these changes? I found that I only needed to keep dither_matrix.h from this patch for it to build. The default interp_algo could also be patched to bilinear to preserve the current behavior.
If the patch should be kept, would it be acceptable to add the interpolation options to this patch? I played around with this and could possibly make a PR for it, but I don't have a background in graphics processing so I might make mistakes.

For testing, I built jellyfin-ffmpeg with the scale_cuda patch stripped down to just dither_matrix.h and ran permutations of the following command with this video (downloaded in 1080p, 4K and 8K) on an RTX 4090: ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i reptiles.webm -vf scale_cuda=-2:4319=interp_algo=lanczos -f null -, where 4319 and lanczos vary by run depending on target resolution and interpolation. I use 4319 here to keep it more or less the same resolution but forcing it to actually interpolate.

For the current jellyfin-ffmpeg, I ran the same command but without interp_algo.

	8K -> 8K	8K -> 1440p	4K -> 4K	4K -> 1440p	1080p -> 1080p*
Bilinear (Jellyfin)	fps=130 speed=2.18x	fps=131 speed=2.18x	fps=513 speed=8.56x	fps=519 speed=8.66x	fps=2000 speed=33.38x
Bilinear (upstream)	fps=131 speed=2.19x	fps=131 speed=2.19x	fps=533 speed= 8.9x	fps=530 speed=8.84x	fps=2043 speed=34.08x
Bicubic (upstream)	fps=133 speed=2.22x	fps=131 speed=2.18x	fps=533 speed= 8.9x	fps=519 speed=8.66x	fps=2016 speed=33.65x
Lanczos (upstream)	fps=130 speed=2.16x	fps=133 speed=2.22x	fps=515 speed=8.59x	fps=515 speed=8.59x	fps=2043 speed=34.08x

*I got run-to-run variation on this so ran it 6 times for each and averaged the results. I ran the others only twice because the second run was always within 1-2fps, sometimes the exact same.

The results confuse me a bit since I expected bicubic and lanczos to be slower. I hear a bit of coil whine for lanczos, but the FPS is similar across interpolators and utilization % is 95%+ for Video Decode and 3D (3D occasionally dipping to 60-ish%). It seems like decoding is generally the bottleneck in my case. I assume this is because the GPU is overkill and there will be a measurable difference for other GPUs. But upstream bilinear seems very slightly faster if anything compared to the current.

jellyfin / jellyfin-ffmpeg

Question about `scale_cuda` interpolation #491