Request: OpenCL NLMeans

quamt commented 3 months ago

Hello @rigaya,

As always, I'm inspired by your dedication to enhancing QSVEnc and am grateful for the opportunity to contribute ideas that might further enrich this incredible tool.

Today, I want to discuss a particular situation that is significant to many of us: improving denoise filters for anime content. Anime, with its unique visual characteristics, requires a careful approach to noise reduction that maintains its sharp edges and vibrant colours.

Integrating FFmpeg's Non-local Means (NLM) denoise filter, especially its OpenCL implementation, could be highly beneficial. NLM is widely recognized for its effectiveness in managing stylized visuals in anime and cartoons. It reduces noise while preserving the integrity of line art and flat colour areas, which are fundamental elements of anime's aesthetic appeal.

While filters like --vpp-convolution3d and --vpp-knn are useful for noise reduction, they may not be as effective as NLM in preserving the details in animated content. For example, --vpp-convolution3d is great for reducing spatial and temporal noise in live-action content, but it can sometimes blur the distinct lines and colours in anime. Similarly, --vpp-knn is good at reducing strong noise, but it may not always be able to distinguish between noise and the unique artistic elements that define anime.

The possibility of NLM to complement the current tools is an exciting prospect. It can offer a customized approach to anime and comparable content, ensuring that QSVEnc users achieve better denoising results. This will preserve and improve the unique visual features of anime, even in the presence of noise.

Adding a new filter like NLM requires careful consideration of various technical and performance aspects. However, due to anime's increasing popularity and the particular challenges involved in processing this type of content, incorporating NLM could greatly enhance QSVEnc's appeal and usefulness for a broader range of users.

Thank you for your time in considering this proposal. Your work is truly inspiring and continues to drive the community forward. I am eager to hear your thoughts on this matter.

For further reference on the OpenCL implementation, you might find this FFmpeg documentation useful: https://ffmpeg.org/doxygen/trunk/opencl_8c.html

rigaya commented 3 months ago

I think I'll be able to add nlmeans filter to QSVEnc.

However, please note that it will take time, as I need to implement in a different way, and also try optimization as I think nlmeans is rather slow with naive implementation.

quamt commented 3 months ago

Thank you for considering integrating the NLMeans filter into QSVEnc. Your willingness to explore new ideas, even with challenges like this one, truly showcases your dedication to enhancing QSVEnc.

Adding such a feature might take time, especially ensuring it meets your high standards. Please know there's no rush; your careful and systematic approach is one of the reasons the community values your work so much.

Again, Thank you for your incredible support and for continually pushing the boundaries of what QSVEnc can do. I appreciate your efforts.

rigaya commented 2 months ago

It's been a while, but I have added (--vpp-nlmeans) in QSVEnc 7.63.

While nlmeans itself is a relatively simple algorithm, the implementation of this filter has taken long time, due to the large computational cost of it's algorithm and the need for optimization. Complicated optimization easily led to bugs, and it was quite difficult to track down and fix them.

The actual implementation is based on the method described in the link, and different from the ones in ffmpeg.

This is because I cannot port implementation directly from ffmpeg, and also the implementation of nlmeans in ffmpeg was far from the basic algorithm perhaps due to different optimization method, so honestly I could not unsderstand what is going on there.

Therefore, please note that the implementation is not entirely compatible with the one of ffmpeg, and the σ and h parameters for denoising strength have a different range from those of ffmpeg. Also, the default patch size and search size in nlmeans of ffmpeg were rather large (so slow) to be used with the hw encoder, so I have used a slightly smaller size as the default (patch 7->5, search 15->11). Of course, these can be changed via parameters.

Still, to be honest, iGPU would be too slow for this filter. It is more intended to be used on dGPU like A770, A750, A380. For example, A380 was able to run around 37fps, but UHD770 run only 8fps with 1080p @ default parameter.

qsvencc_nlmeans_20240428

quamt commented 2 months ago

Thank you for adding the nlmeans filter in QSVEnc 7.63! Given the computational complexities involved, I really appreciate the effort and time it must have taken to implement this. It's great that you optimized it for hardware-accelerated encoding, even if it required intricate debugging.

I understand your challenges in optimizing the algorithm while keeping the computational cost manageable. Adjusting the default parameters for patch and search size sounds like a good move, considering the performance limitations with integrated GPUs.

I'm eager to try out this filter and see how it performs on dGPUs like the A770 and A750. I will explore the new filter and see how the tuning affects the output.

Thanks again for your hard work!

quamt commented 2 months ago

I ran some test files, and the results are fantastic. On my A770 setup, I'm getting around 50 fps with fp16=none and other settings on default, which is an excellent performance considering the algorithm's complexity.

Your hard work and attention to detail in optimizing the filter are much appreciated. Thank you. I'm looking forward to using nlmeans in my future projects.

rigaya commented 2 months ago

Thank you for testing vpp-nlmeans, nice to hear that the implementation works fine also with A770, and the results are as expected!

In performance perspective, A380 runs around 20fps with fp16=none, so A770 is about x2.5 faster than A380. It looks quite understandable as memory bandwidth of A770 is theoretically x3 faster than A380. (vpp-nlmeans is memory bandwidth bound)

I'll close the issue as the requested feature has been added.

rigaya / QSVEnc

Request: OpenCL NLMeans #191