Hardware accelerated decoding

routerino commented 3 years ago

Hi,

Just testing the program now. Is there an option to specify quicksync or VAAPI based decoding? Will the program do that automatically?

iEvgeny commented 3 years ago

Hi! No. Hardware acceleration support is not yet implemented.

erkexzcx commented 3 years ago

Is there any plans to implement hardware acceleration? From my understanding, specifying -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 is enough for ffmpeg options in my case. However, when I specify these options, CCTV Viewer application auto removes -hwaccel_device /dev/dri/renderD128 option, making it unusable.

iEvgeny commented 3 years ago

Hardware acceleration is not currently implemented. There is no point in trying to pass FFmpeg parameters via the command line or any other way. And yes, I have plans to implement hardware acceleration, but it's not a priority right now.

cybermaus commented 3 years ago

So, while I understand this is a huge undertaking, and this is not something to expect soon. I did some comparing to see the benefit of this. On a RasPi 4 with 4GB, running the same four 1280x720 15fps H264 streams in both omxplayer and cctv-viewer.

omxplayer: 16% CPU total (4% per process), 66 Celcius core temp cctv-player: 260% CPU total, 85 Celcius core temp (temp warning flashing and I suspect it is throttling)

I understand (well, suspect, I do not really know) building full ffmpeg hardware support is not easy, but the benefit is rather significant.

In the interim, would it be possible to get some script API script call, with URL and screen coordinates, and if not found or non-zero return, continue with native ffmpeg processing?

Mind you, for now I am simply using cctv-player with the 640x480 substreams, and it runs fine with 4 of those. The recorder is still recording full resolution, so this is quite acceptable for the monitoring station. I guess only when enlarging it would be nice if it swapped to the main resolution temporary.

Eventually I want to use RapPi 3 however (those mini-HDMI are a pain) but I do not have one around for testing. I am hoping that too can carry 4 sub-streams OK.

PS: command used: omxplayer --win 0,0,959,539 --avdict 'rtsp_transport:tcp' <url>

iEvgeny commented 3 years ago

Hi! I think support for hardware acceleration will appear with the porting of the application to Qt 6.2. But in any case it will be a compromise solution. I expect that different platforms, support different number of simultaneously hardware decoded video streams. In my case each preset contains up to 16 streams... I don't think every platform can provide the ability to decode such a set of data and this variant will need to be handled correctly somehow.

erkexzcx commented 3 years ago

Hi. Would it be possible to implement a basic hardware acceleration? Something that would be by default disabled and then using command line arguments (flags) can be enabled per-camera basis? I am currently running this app on Intel NUC 4x4 matrix and NUC gets pretty hot. Hardware acceleration would be greatly appreciated!

Even experimental hardware acceleration would be super awesome!

erkexzcx commented 3 years ago

Hi. Any plans for hardware acceleration, even the basic one?

MarcoRavich commented 1 year ago

+1 for hardware-accelerated decoding

It would be really interesting to "squeeze out" whole HW capabilities to obtain better performances (and, above all, latencies) !

We suggest you to check @rigaya's repos to understand what you could obtain: https://github.com/rigaya

Meanwhile you could try hwaccel auto FFMPEG parameter.

Last but not least, you can also grab some other ideas/approaches about (linux) HW video decoding in this wiki page: https://github.com/opencv/opencv/wiki/Video-IO-hardware-acceleration

Hope that inspires !

iEvgeny commented 1 year ago

Hi all! I hasten to inform you that the built-in player has been radically redesigned and hardware accelerated video decoding is currently available in experimental mode, as well as Zero-copy rendering for X11 desktops.

Hardware acceleration is controlled using the corresponding FFmpeg options: 1) -hwaccel [method] 2) -hwaccel_output [backend]

These options can be set globally in the application settings and for each viewport.

I recommend starting with the single option -hwaccel. In this case, decoding will be performed by hardware, but rendering will be done through frame copying to system memory. The list of available decoding methods for your system can be obtained with the command: $ ffmpeg -hwaccels

Currently only the vaapi method is fully tested. For all the others you will probably need to install some libraries. Something is guaranteed not to work. I suspect that it will be drm.

Please note that hardware decoding by its nature has hard restrictions. For example, my hardware does not support Baseline profile for h264 decoder. In this case you will not see any messages, decoding will continue on CPU. Indication will be added in the future.

In general, hardware decoding with frame copying to system memory saves RAM. However, in some cases due to copying large amounts of data CPU load may even increase. In my case with a large number of viewports I notice a significant saving of all resources.

However, the full potential of hardware decoding is revealed only with Zero-copy rendering!

Use the -hwaccel_output option in conjunction with the -hwaccel option like this: -hwaccel vaapi -hwaccel_output glx This combination activates hardware accelerated video decoding with VA-API and Zero-copy rendering for X11 desktops.

SPECIAL NOTE: 1) Zero-copy is currently only implemented for X11-based systems (glx backend) 2) Zero-copy may also require additional packages to be installed. For example, in my case with Intel GPUs I need to install the intel-media-va-driver-non-free package 3) Actually -hwaccel_output , is not an FFmpeg option, but a CCTV Viewer option. Do not look for information about it in FFmpeg help. It's done that way for convenience and uniformity. It may be renamed or replaced by another mechanism in the future. 4) If possible, use a package from a PPA rather than SNAP. I have done my best, but due to container isolation HW accel of video decoding or Zero-copy rendering in SNAP may not work properly in some specific cases.

P.S. Due to the deep redesign of the built-in player, various regressions are possible. Please report about them in separate threads.

MarcoRavich commented 1 year ago

...where to download the 0.1.9 ppa ? (we don't have/need snap in our Mint 21.2 NUC installation)

iEvgeny commented 1 year ago

https://launchpad.net/~ievgeny/+archive/ubuntu/cctv-viewer

P.S. Ignore the version in the package name. I'll fix it soon.

MarcoRavich commented 1 year ago

OK, here's our 1st feedback for 6 x 2Mbps - CBR - RTSP streams (960x540 @ h264, 30fps) from our PTZ cameras displaying.

VAAPI's HW-decoding and ZeroCopy works correctly - lows the CPU usage from 30 to ~10% and increases GPU from 0 to ~15% - only by completely disabling Xfce windows manager's (Xfwm4, in our case) graphic accellerations: it does not work correctly with both software (Compositing: it blanks the second to last stream) or hardware (Compton: it produces display "errors" alternately on each stream) ones enabled.

Hope that helps.

note: do you plan to enable QSV too ?

iEvgeny commented 1 year ago

note: do you plan to enable QSV too ?

All nethods enumerated in $ ffmpeg -hwaccels should be supported. You may only need to install some driver packages for your platform. Additional work will only be required to implement Zero-copy for each method. "Polishing" the existing functionality is the priority at the moment.

All new features related to hardware acceleration of video decoding will be reported in this thread.

Could you please specify what hardware platform you have?

P.S. By the way, try -fflags nobuffer -flags low_delay options to reduce latency. Now all FFmpeg options are correctly passed to the corresponding subsystems.

MarcoRavich commented 1 year ago

All nethods enumerated in $ ffmpeg -hwaccels should be supported. You may only need to install some driver packages for your platform.

We've installed the intel-media-va-driver-non-free drivers but, of course, we can add the whole Intel oneAPI/VPL if needed.

Additional work will only be required to implement Zero-copy for each method. "Polishing" the existing functionality is the priority at the moment.

We'll soon test each method to report its correct functionality. Just a question: since FFMPEG accepts "auto" for hwaccel option, does it work in cctv too ?

All new features related to hardware acceleration of video decoding will be reported in this thread.

We are tuned on.

Could you please specify what hardware platform you have?

As mentioned in this other issue we do use an Intel NUC NUC5i5MYHE that relays on a i5-5300U that embeds the HD Graphics 5500.

P.S. By the way, try -fflags nobuffer -flags low_delay options to reduce latency. Now all FFmpeg options are correctly passed to the corresponding subsystems.

OK, we'll test - and report - later.

iEvgeny commented 1 year ago

Just a question: since FFMPEG accepts "auto" for hwaccel option, does it work in cctv too ?

"auto" is not currently supported. But it doesn't do any miracles.

Strictly speaking, FFmpeg options divide into 2 categories: those implemented by FFmpeg libraries (libavformat, libavcodec, etc.) and those implemented by FFmpeg utilities (ffmpeg, ffplay, ffprobe). The former are transferred to libraries as is, the latter must be implemented by CCTV Viewer and their implementation may differ or be missing.

As for QSV, it seems to require specifying a compatible codec https://trac.ffmpeg.org/wiki/Hardware/QuickSync#Decode-only This option and consequently feature is not yet available in CCTV Viewer. Looks like it's time to create a Wiki section....

MarcoRavich commented 1 year ago

OK, quick -2nd- feedback: it works (with -fflags nobuffer -flags low_delay too) but only choosing vaapi.

Every other acceleration - at least on our configuration - have no impact on the CPU load.

Later we'll report about latencies with and without optimization. (note: do you prefer pics or videos ?)

iEvgeny commented 1 year ago

-fflags nobuffer -flags low_delay - these options are not related to hardware acceleration and can be tested independently.

Text information is enough, but if it is supplemented with graphic materials, it will only be better.

MarcoRavich commented 1 year ago

1st set of latency tests

Same hw/sw config (Intel NUC NUC5i5MYHE / Mint 21.2 XFCE / latest cctv) and RTSP streams (6 x 960x540 / h264 / 30fps @ 2Mbps - CBR). Note that latency has been tested on the 1st stream (fullscreened on a Philips 190S8FB/00 monitor) only.

We runned the - russian - Sekundomer's Online Stopwatch on a Xiaomi Redmi Note 9S and grabbed both outs using a Canon EOS 1200D.

Both PTZ and DSLR cameras has been manually configured to 1/250 shutter speed.

RESULTS:

best latency has been achieved without adding any parameters (~100-150ms);
a bit slower using -fflags nobuffer -flags low_delay ones (~200ms);
same result using -hwaccel_output glx only (means it's skipped ?);
slowest result using all -fflags nobuffer -flags low_delay -hwaccel vaapi -hwaccel_output glx parameters togheter (~1s).

Let us know if you need more comprehensive tests (and relative images).

iEvgeny commented 1 year ago

-hwaccel_output glx without -hwaccel vaapi makes no sense. It will just be software decoding. The -fflags nobuffer option looks irrelevant if look at the FFmpeg source code.

I don't have much hope for latency reduction with hardware decoding, but it makes sense to test the following keysets:

-flags low_delay
-hwaccel vaapi -flags low_delay
-hwaccel vaapi -hwaccel_output glx
-hwaccel vaapi -hwaccel_output glx -flags low_delay

MarcoRavich commented 1 year ago

I don't have much hope for latency reduction with hardware decoding, but it makes sense to test the following keysets:

Of course, HW-decoding target is the CPU-work offloading (= less energy consumption/heat) NOT latency.
* `-flags low_delay`

* `-hwaccel vaapi -flags low_delay`

* `-hwaccel vaapi -hwaccel_output glx`

* `-hwaccel vaapi -hwaccel_output glx -flags low_delay`
Ok, later we'll test and report. Just a question: do we need to test latencies for all 6 streams (togheter, of course) or it's the same by default ?

Last but not least, in this stackoverflow reply @teocci suggests some other interesting FFMPEG's parameters to test: How to minimize the delay in a live streaming with ffmpeg

iEvgeny commented 1 year ago

It makes sense to test only one thread so that the multithreaded environment does not distort the result when competing for system resources.

Last but not least, in this stackoverflow reply @teocci suggests some other interesting FFMPEG's parameters to test: How to minimize the delay in a live streaming with ffmpeg

Thanks for info.

MarcoRavich commented 1 year ago

Ok, after many tests - we also tried switching to low latency kernel too - we can't obtain any lowest latency than 150ms (no parameters) in our rig. It means that it should be a "phisical" limit.

Thanks for your active support.

iEvgeny / cctv-viewer

Hardware accelerated decoding #9