Support hardware-accelerated decoding and tone-mapping

mertalev commented 11 months ago

This is a tracking issue for adding hardware decoding and tone-mapping support for transcoding.

What

Hardware-accelerated decoding loads videos to an acceleration device to decode with the device's built-in support for certain codecs and formats. This differs from software decoding, where videos are instead loaded and decoded by a program.

Hardware-accelerated tone-mapping is similarly performed within the acceleration device, but takes place after decoding.

Why

Hardware decoding is good for a number of reasons.

It's faster
- Accelerated decoding is naturally faster by virtue of its dedicated hardware optimization
- By keeping data in the acceleration device, it avoids starvation from the CPU not serving decoded data quickly enough
- Decoded video can be directly used by the acceleration device without needing to do a relatively expensive CPU->GPU transfer
- It avoids contention in cases where the CPU is concurrently doing other intensive work
It reduces CPU load
- Since the CPU doesn't decode the video, the incurred load of decoding is very minimal
- Particularly on lower-end devices, the relative performance of the acceleration device compared to the CPU can be drastic, meaning that using the CPU with software decoding requires exerting it heavily in order to keep up with the device's encoding speed

Concerns

Source videos come in many different forms, and it's tricky to know in advance whether the device can decode a given video (at least in JavaScript)
Different APIs may expose different tone-mapping options and modes, so supporting the current settings in each API may require more effort
Hardware tone-mapping is essentially a pre-requisite for hardware decoding
- Since hardware decoding loads videos to the device, software tone-mapping would require a GPU->CPU transfer after decoding followed by another CPU->GPU transfer after tone-mapping, the overhead of which defeats the point of acceleration
While less relevant in recent years, hardware decoding can in some cases have lower quality than software decoding

Tasks

[x] Ensure it uses software decoding if the user's hardware can't decode or tone-map a video
[ ] Ensure that in cases of incompatibility, it still uses accelerated encoding rather than falling back entirely to software
[x] Ensure that current tone-mapping options are available for each API where possible
[x] Support Quick Sync
[x] Support NVENC
[ ] Support VAAPI

cyfdecyf commented 11 months ago

Just add my two cents about tone-mapping using Intel Quick Sync.

Calling ffmpeg with -vf "vpp_qsv=tonemap=1" enables hardware accelerated tone-mapping with QSV, but it's only available when oneVPL is enabled when compiling FFmpeg. (When run FFmpeg ./configure, just replace --enable-libmfx with --enable-libvpl, Intel oneVPL library is needed of course.) According to oneVPL dispatching behavior, I'm not sure whether this would work with Intel processor before Tiger Lake.

mertalev commented 11 months ago

The version of FFmpeg we use is built with oneVPL, so that shouldn't be an issue, but it does seem like this wouldn't work if it dispatches to Media SDK. Jellyfin docs mention that the main advantage of QSV's tonemapping is lower power consumption, but otherwise OpenCL has wider hardware compatibility and is more customizable. Maybe that's the direction to go in that case.

yodatak commented 11 months ago

Does it will apply for generate the thumbnails it could be good for big library of photos ?

mertalev commented 11 months ago

No, it wouldn't have an effect on images. But for live/motion photos, the video portion of these would benefit.

rishid commented 7 months ago

I was curious about why immich even with hardware transcoding enabled was basically maxing out my 16 cpu cores even with only doing 1 transcode job. It also only is using 15% of my GPU render capability.

I went ahead and played with some of the ffmpeg options. Most of this is known but just adding my findings here:

Here is a sample immich ffmpeg call when using Intel QSV

ffmpeg -init_hw_device qsv=hw -filter_hw_device hw -i upload/upload/4ef.../...c2f.MOV -y -c:v hevc_qsv -c:a aac -movflags faststart -fps_mode passthrough -map 0:0 -map 0:1 -bf 7 -refs 5 -g 256 -v verbose -vf zscale=t=linear:npl=100,tonemap=hable:desat=0,zscale=p=bt709:t=bt709:m=bt709:range=pc,format=nv12,hwupload=extra_hw_frames=64,scale_qsv=1080:-1 -preset 7 -global_quality 23 upload/encoded-video/4ef.../...c7d.mp4

As stated in original post, we aren't using hardware decoding, by enabling this I see about a 5% reduction in CPU load. I get a 5% improvement also by setting the preset to fast.

I am not super familiar with ffmpeg but the remainder of extra cpu load is coming from the filters. Is there a reason we need to do tone-mapping and all the zscale options? If I trim it down to the following, I get about 75% reduction in CPU load.

/usr/lib/jellyfin-ffmpeg/ffmpeg -init_hw_device qsv=hw -filter_hw_device hw -c:v hevc_qsv -i /config/test.MOV -y -c:v hevc_qsv -c:a aac -movflags faststart -fps_mode passthrough -map 0:0 -map 0:1 -bf 7 -refs 5 -g 256 -v verbose -vf format=nv12,hwupload=extra_hw_frames=64,scale_qsv=1080:-1 -preset fast -global_quality 23 /config/test_OUT.mp4

alextran1502 commented 7 months ago

@rishid Thumbnail generation is still using CPU. If you don't have machine learning setup to use GPU, it will also uses CPU

rishid commented 7 months ago

Sure understood but specifically the single parent ffmpeg process, which is doing the video transcoding for encoded-videos, is the showing cpu usage of ~800% on my machine.

alextran1502 commented 7 months ago

Unsure, perhaps passing through configuration is not right?

rishid commented 7 months ago

I completely forgot there are a lot of config knobs for Video Transcoder settings available in Immich - I think all my observations can be controlled already.

mertalev commented 3 months ago

For Quick Sync, I got VPP tone-mapping working, but OpenCL doesn't work (something about not being able to allocate memory to the OpenCL device) and Vulkan is almost thrice as slow because it doesn't support zero-copy like it does for CUDA. VPP doesn't have the tone-mapping settings we use for other backends, but it is also the fastest option and tailored specifically for Intel devices. I can use that for QSV and let VAAPI use OpenCL (once I figure out how to get it to work).

immich-app / immich