intel / libva

Libva is an implementation for VA-API (Video Acceleration API)
http://intel.github.io/libva/
Other
664 stars 303 forks source link

VA-API hardware decoding is slower than software decoding on Intel Celeron N4000 #734

Closed Talkless closed 1 year ago

Talkless commented 1 year ago

We have some very small Chinese mini-PC that has Intel N4000.

I've installed Debian 12 in it, with VA-API:

$ vainfo 
libva info: VA-API version 1.17.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_17
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.17 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.1.1 ()
vainfo: Supported profile and entrypoints
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileNone                   : VAEntrypointStats
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointFEI
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointFEI
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointFEI
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileVP8Version0_3          : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointFEI
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointEncSlice
      VAProfileVP9Profile0            : VAEntrypointVLD
      VAProfileVP9Profile2            : VAEntrypointVLD

I'm using GStreamer 1.22.1 with vah264dec element, but on this machine (works fine on other Celerons) I get only about ~16FPS for 720p, while using avdec_h264 software decoder element (ffmpeg) I can get full 25fps.

intel_gpu_top does show that "Video" usage is non-zero with vah264dec, and zero with software decoding, so I assume it does in principle work..?

GStreamer logs while playing videounder vah264dec:

0:00:13.954733102 16454 0x561480aa7c00 WARN            videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.719919487 deadline:0:00:12.719919487 earliest_time:0:00:13.347737097
0:00:13.955002097 16454 0x561480aa7c00 WARN            videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.759917944 deadline:0:00:12.759917944 earliest_time:0:00:13.347737097
0:00:13.961621624 16454 0x561480aa7c00 WARN            videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.799916413 deadline:0:00:12.799916413 earliest_time:0:00:13.347737097

I'm not really sure if I should report this issue here or to GStreamer though, so sorry if misjudged, though it seemed as if something's wrong with VA driver.

XinfengZhang commented 1 year ago

how about media engine usage from intel_gpu_top? and what's the whole gst command line?

Talkless commented 1 year ago

This is what I see in intel_gpu_top: paveikslas

Where Viewer is our Qt application with GStreamer playback.

GST pipeline:

rtspsrc location=rtsp://... protocols=tcp latency=100 buffer-mode=slave ! queue max-size-buffers=0 ! rtph264depay ! h264parse ! vah264dec compliance=3 ! glupload ! glcolorconvert ! qmlglsink

Same issue with Dropping frame due to QoS if I use it via gst-launch and glimagesink in terminal.

Talkless commented 1 year ago

Looks like it's the similar performance issue with another computer having Celeron J4125.

It renders 720p at about 18-20fps (while original stream is 25fps), and 1080p is rendered only at ~9fps, meawhile software decoder can handle 1080p at full 25fps.

It has Debian 11 though, I can try installing 12.

Talkless commented 1 year ago

N4500 works fine if I boot Debian 11 by forcing GPU detection with i915.force_probe=4e55.

J3060 and I believe J1900 worked fine too.

Talkless commented 1 year ago

I've upgraded J4125 machine to Debian Sid, and now it handles TWO video streams at 1080p at 25fps.

I'll try to upgrade N4000 to Sid too.

Talkless commented 1 year ago

Just upgrade N4000 to Sid too.

vainfo:

r$ vainfo 
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_18
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.19 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.2.3 ()
vainfo: Supported profile and entrypoints
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileNone                   : VAEntrypointStats
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointFEI
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointFEI
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointFEI
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileVP8Version0_3          : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointFEI
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointEncSlice
      VAProfileVP9Profile0            : VAEntrypointVLD
      VAProfileVP9Profile2            : VAEntrypointVLD

Sadly, upgrade didn't help. N4000 manages only about 16-17fps @ 720p, and 9fps on 1080p.

XinfengZhang commented 1 year ago

from intel gpu top. the video utilization is 2.43%, it is almost free, so, it is not a decode issue, it maybe caused by other reason. AFAIK, it could decode multiple sessions. I guess, it related with the glcolorconvert, @xhaihao could you help to check the command line, suppose it is not a suitable one.

xhaihao commented 1 year ago

@Talkless There should be a data copy between vah264dec and glupload, could you check the used caps ? You may specify video/x-raw(memory:DMABuf) if you want to avoid data copy.

Talkless commented 1 year ago

~If it's data copy issue, why it disappears for J4125 if I upgrade to Debian Sid while using same my own built GStreamer 1.22.1 binaries (I don't use distribution GStreamer packages)?~

~My hypothesis is that newer va-api drivers fixed it (I'm using non-free variants in Debian, such as i965-va-driver-shaders and intel-media-va-driver-non-free).~

I'll try to fiddle with caps and will try to render pipeline visualization to see what it's doing though, thanks for the hints.

EDIT: I take my words about J4125 working on Sid back. Just upgraded form 12 to Sid again and I don't see performance fixed. Not sure why I was sure about it working OK. Sorry, gotta do more research.

Talkless commented 1 year ago

Now that's discovery for me:

paveikslas

Even thought vah264dec and glupload both support DMABuf, it is not used by default.. video/x-raw is used. So I guess if system is fast enough, I did not noticed copying penalty, so I guess you're right. I just need to specify caps correctly because so far I failed to make it work...

Talkless commented 1 year ago

If I explicitly use "slow" version like this: ... vah264dec ! video/x-raw ! glimagesink it works as it was before, but if I specify video/x-raw(memory:DMABuf) instead it fails with kinda irrelevant error message failed delayed linking some pad of GstQTDemux named qtdemux0 to some pad of GstH264Parse named h264parse0 using this testing pipeline:

$ ./gst-launch-1.0  curlhttpsrc location="https://ia800201.us.archive.org/12/items/BigBuckBunny_328/BigBuckBunny_512kb.mp4" ! qtdemux! h264parse ! queue ! vah264dec ! "video/x-raw(memory:DMABuf)" ! glimagesink 
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Got context from element 'sink': gst.gl.GLDisplay=context, gst.gl.GLDisplay=(GstGLDisplay)"\(GstGLDisplayX11\)\ gldisplayx11-0";
Got context from element 'vah264dec0': gst.va.display.handle=context, gst-display=(GstObject)"\(GstVaDisplayDrm\)\ vadisplaydrm1", description=(string)"Intel\ iHD\ driver\ for\ Intel\(R\)\ Gen\ Graphics\ -\ 22.2.1\ \(\)", path=(string)/dev/dri/renderD128;
ERROR: from element /GstPipeline:pipeline0/GstCurlHttpSrc:curlhttpsrc0: Internal data stream error.
Additional debug info:
../src/libs/gst/base/gstbasesrc.c(3132): gst_base_src_loop (): /GstPipeline:pipeline0/GstCurlHttpSrc:curlhttpsrc0:
streaming stopped, reason not-linked (-1)
ERROR: pipeline doesn't want to preroll.
WARNING: from element /GstPipeline:pipeline0/GstQTDemux:qtdemux0: Delayed linking failed.
Additional debug info:
gst/parse/grammar.y(853): gst_parse_no_more_pads (): /GstPipeline:pipeline0/GstQTDemux:qtdemux0:
failed delayed linking some pad of GstQTDemux named qtdemux0 to some pad of GstH264Parse named h264parse0
Setting pipeline to NULL ...
Freeing pipeline ...

Maybe I need to enable DMABuf suport some how in my systems..?

Talkless commented 1 year ago

So the issue was that GStreamer does not support DMABuf with GLX. It works with EGL if I specify GST_GL_PLATFORM=egl env. variable!

For example, using:

env GST_GL_PLATFORM=egl ./gst-launch-1.0 filesrc location=bbb_sunflower_1080p_30fps_normal.mp4 ! qtdemux! h264parse ! queue ! vah264dec ! glimagesink

I get flawless playback on N4000 with just around ~6% CPU, meanwhile if I go back to GLX I get ~100% CPU with lots of frame dropped. video/x-raw(memory:DMABuf) is ignored with GLX.

Closing as invalid.

XinfengZhang commented 1 year ago

cool, how about intel_gpu_top result? Suppose you could playback several bitstream simultaneously

Talkless commented 1 year ago

@XinfengZhang yeah it could play 4 streams. Well not flawlessly (with some rare stuttering), maybe 3 x 1080p would be more realistic/practical. Pretty good result for such a tiny "hdmi stick" pc like this:

paveikslas

intel_gpu_top with four players:

paveikslas

XinfengZhang commented 1 year ago

yes, video engine still have rooms to decode more streams, but render engine utilization is full

Talkless commented 1 year ago

@XinfengZhang Might work better without whole Gnome desktop, etc. In Wayland, etc. But still, great stuff. Thanks!