Performance - sample_multi_transcode Significantly Faster for Same Operation

arbishop commented 6 years ago

Going through the process of proving out performance of an Atom E3950 for transcoding I found a huge discrepancy between sample_multi_transcode and gstreamer.

Scaled tests (several more transcodes at once) with sample_multi_transcode and proved out good (faster than real time). Unfortunately, gstreamer when ramped up did not perform nearly as well. I have below the most simple example I could think of and it shows a 223% performance delta with respect to transcode time.

I'm not sure where to go from here, perhaps someone has advice on how to make gstreamer-media-SDK a viable option.

One thing that I noticed is that gstreamer-media-SDK doesn't seem to call the "JoinSessions" API while sample_multi_transcode does. I'm not sure if architecturally gstreamer doesn't need to. Another is that gstreamer doesn't seem to utilize as much of the GPU.

It's also very likely I'm doing something incorrect with the pipeline, not sure. Are the results below expected, surprising?

Intel: sample_multi_transcode

$> sample_multi_transcode -par versus.par

-i::mpeg2 /tmp/transcode-test/knsd1080i.es -o::sink -async 4 -u 7 -deinterlace::BOB -w 1280 -h 720  -join
-o::h264 /tmp/transcode-test/versus.h264 -i::source -async 4 -w 1280 -h 720 -b 2000 -u 7 -join

libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
Multi Transcoding Sample Version 8.1.24.0

Par file is: versus.par

MFX HARDWARE Session 0 API ver 1.24 parameters: 
Input  video: MPG2
Output video: To child session

MFX HARDWARE Session 1 API ver 1.24 parameters: 
Input  video: From parent session
Output video: AVC 

Pipeline surfaces number (EncPool): 7
Pipeline surfaces number (DecPool): 7
Session 0 was joined with other sessions
Session 1 was joined with other sessions

Transcoding started
..................................
Transcoding finished

Common transcoding time is 19.0431 sec
-------------------------------------------------------------------------------
*** session 0 PASSED (MFX_ERR_NONE) 19.0208 sec, 3315 frames
-i::mpeg2 /tmp/transcode-test/knsd1080i.es -o::sink -async 4 -u 7 -deinterlace::BOB -w 1280 -h 720 -join 

*** session 1 PASSED (MFX_ERR_NONE) 19.0428 sec, 3315 frames
-o::h264 /tmp/transcode-test/versus.h264 -i::source -async 4 -w 1280 -h 720 -b 2000 -u 7 -join 

-------------------------------------------------------------------------------

The test PASSED
user: 6.589000s, sys: 5.086000s, elapsed: 19.115402s, CPU: 61.1%, GPU: 47.7%

GStreamer: gst-launch-1.0

gst-launch-1.0 -e \
    filesrc location=$DIR/knsd1080i.es \
    ! mpegvideoparse \
    ! mfxmpeg2dec async-depth=4 \
    ! mfxvpp width=1280 height=720 async-depth=4 deinterlace-mode=1 \
    ! mfxh264enc rate-control=1 async-depth=4 preset=7 bitrate=2000 \
    ! filesink location=$DIR/versus-gst.h264

Setting pipeline to PAUSED ...
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
Pipeline is PREROLLING ...
Got context from element 'mfxench264-0': gst.mfx.Aggregator=context, gst.mfx.Aggregator=(GstMfxTaskAggregator)NULL;
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:42.336294715
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...
user: 7.047000s, sys: 4.932000s, elapsed: 42.625926s, CPU: 28.1%, GPU: 16.8%

Results

Intel user: 6.589000s, sys: 5.086000s, elapsed: 19.115402s, CPU: 61.1%, GPU: 47.7%
GST user: 7.047000s, sys: 4.932000s, elapsed: 42.625926s, CPU: 28.1%, GPU: 16.8%

ishmael1985 commented 6 years ago

Theoretically, both gst-mfx and sample_multi_transcode should perform the same. Joined sessions are used by gst-mfx as well, but this would require that MFX sessions are created for one GStreamer pipeline instance. There are also IO differences to consider, sample_multi_transcode takes fixed chunks while filesrc and parser elements will read data in variable chunks, affecting the performance. You could manually try to invoke multiple instances of the encode pipeline you used, and check out the GPU utilization and time. I think the performance difference would be smaller even though in this case, join operations are not being used by gst-mfx.

arbishop commented 6 years ago

I'm still not clear on whether gst-mfx uses the JoinSessions API, seems like you said it does and doesn't above.

I'm don't know if joining has anything to do with the performance delta, it's the only item I had to comment with.

I had tried making the gst-mfx filesrc blocksize massive. Perhaps there is a fixed size property to be played with.

Before coming here for advice, I had run multiple pipelines as well as a single monolithic pipeline for 6->12 video transcodes. The GPU appeared to peak around 66% whereas sample_multi_transcode was nearing 95%. The case above was purposefully simplified to gain traction here. The performance delta doesn't go away when scaled.

Any known performance hot spots that you could point me to? If I were to have time to take a gst-mfx deep dive.

ishmael1985 commented 6 years ago

Can you try a pipeline something similar to this:

gst-launch-1.0 filesrc location=/path/to/trailer_1080p.mov ! qtdemux ! h264parse ! tee name=t \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. \
! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false

In this pipeline, all MFX sessions are joined. If you run separate GStreamer MFX pipelines, of course the sessions will be disjoined.

I tried it on a Kabylake Ubuntu 16.04 system with kernel 4.16 installed via ukuu and I do get a 100% VIDEO usage with metrics monitor (VIDEO_E usage is giving 0%, weird). Encode finishes in around 24 seconds for the 12 transcoding instances for a 30 second 1080p H264 file.

ishmael1985 commented 6 years ago

Btw, you get faster transcode performance by building the plugins from my repo - https://github.com/ishmael1985/gstreamer-media-SDK

ishmael1985 commented 6 years ago

hmm intel_gpu_time does give me a different story for the above pipeline:

user: 9.142866s, sys: 3.532654s, elapsed: 24.946005s, CPU: 50.8%, GPU: 76.4%

ishmael1985 commented 6 years ago

Here's a few more pipelines demonstrating the variable GPU utilization, this one's for multiple dual HEVC transcodes:

$> sudo intel_gpu_time gst-launch-1.0 filesrc location=/path/to/Videos/trailer_1080p.mov ! qtdemux ! h264parse ! tee name=t ! queue ! mfxdecode ! mfxhevcenc ! fakesink sync=false t. ! queue ! mfxdecode ! mfxhevcenc ! fakesink sync=false
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Got context from element 'mfxench265-1': gst.mfx.Aggregator=context, gst.mfx.Aggregator=(GstMfxTaskAggregator)"\(GstMfxTaskAggregator\)\ mfxtaskaggregator0";
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:14.718858253
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...
user: 1.538721s, sys: 0.566249s, elapsed: 14.794303s, CPU: 14.2%, GPU: 99.5%

GPU utilization is 99.5%. But if I replace mfxhevcenc with mfxh264enc, it goes down to around 75%

$> sudo intel_gpu_time gst-launch-1.0 filesrc location=/path/to/trailer_1080p.mov ! qtdemux ! h264parse ! tee name=t ! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false t. ! queue ! mfxdecode ! mfxh264enc ! fakesink sync=false
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Got context from element 'mfxench264-1': gst.mfx.Aggregator=context, gst.mfx.Aggregator=(GstMfxTaskAggregator)"\(GstMfxTaskAggregator\)\ mfxtaskaggregator0";
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:04.107443319
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...
user: 1.572034s, sys: 0.595156s, elapsed: 4.196368s, CPU: 51.6%, GPU: 75.3%

I used the latest open-source Media SDK with the latest Intel media driver and libva, in conjunction with kernel 4.16. There's a few variables here that affect GPU utilization, but gst-mfx should certainly be able to achieve full GPU utilization depending on certain circumstances. What those circumstances are, we'll have to explore some more.

ishmael1985 commented 6 years ago

Another demonstration of variable GPU utilization, this time through using presets in mfxh264enc:

$> sudo intel_gpu_time gst-launch-1.0 filesrc location=/path/to/trailer_1080p.mov ! qtdemux ! h264parse ! tee name=t ! queue ! mfxdecode ! mfxh264enc preset=veryfast ! fakesink sync=false t. ! queue ! mfxdecode ! mfxh264enc preset=veryfast ! fakesink sync=false
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Got context from element 'mfxench264-1': gst.mfx.Aggregator=context, gst.mfx.Aggregator=(GstMfxTaskAggregator)"\(GstMfxTaskAggregator\)\ mfxtaskaggregator0";
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:04.074954585
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...
user: 1.573834s, sys: 0.571114s, elapsed: 4.161151s, CPU: 51.5%, GPU: 55.6%

GPU utilization is on the low side at around 55% when setting preset=veryfast (TU=7 or speed). But if preset=veryslow (TU=1 or quality), GPU utilization goes up to around 89%.

$> sudo intel_gpu_time gst-launch-1.0 filesrc location=/path/to/trailer_1080p.mov ! qtdemux ! h264parse ! tee name=t ! queue ! mfxdecode ! mfxh264enc preset=veryslow ! fakesink sync=false t. ! queue ! mfxdecode ! mfxh264enc preset=veryslow ! fakesink sync=false
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Got context from element 'mfxench264-1': gst.mfx.Aggregator=context, gst.mfx.Aggregator=(GstMfxTaskAggregator)"\(GstMfxTaskAggregator\)\ mfxtaskaggregator0";
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:04.922317834
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...
user: 1.560836s, sys: 0.665900s, elapsed: 5.011171s, CPU: 44.4%, GPU: 89.4%

I think you need to experiment a bit more with the settings in order to maximize the performance / quality tradeoff for your desired use case.

arbishop commented 6 years ago

I posted a single comparison to highlight a significant performance delta. I just thought the GPU utilization was odd, it's not the metric I'm tuning for. I'm consuming live streams so only the transcoding time is important.

Intel user: 6.589000s, sys: 5.086000s, elapsed: 19.115402s, CPU: 61.1%, GPU: 47.7% GST user: 7.047000s, sys: 4.932000s, elapsed: 42.625926s, CPU: 28.1%, GPU: 16.8%

I would expect GPU utilization to be a function of input codec (mpeg2 < avc < hevc) and codec parameters (Q, lookahead, preset). I have only done tests with preset=7 since time is more important than quality for me. I would be surprised if lower presets didn't always achieve higher GPU utilization. I'm also not doubting the MFX API from being able to achieve close to 100% GPU utilization in some scenarios.

I experimented massively prior to raising this ticket. I ran tests using multi_sample_transcode, scaled up and then switched over to gst-mfx where I found I couldn't make timing any longer.

I can look into using your branch, can you elaborate on why yours would be faster than master here?

ishmael1985 commented 6 years ago

The gst-mfx plugins in my branch (https://github.com/ishmael1985/gstreamer-media-SDK) are faster than current master here because of the optimized allocation and sharing of surfaces between MFX tasks. For example, with the current master, a transcoding pipeline (MFX decode + MFX VPP + MFX encode) would allocate 3 surface pools incurring overallocation of surfaces, but with gst-mfx, it would only take 2 surface pools. The performance difference could be noticeable when multiple MFX transcoding pipelines are invoked by your application due to this reason. Well if you noticed, for H264 the transcode times are not much different between speed and quality presets. You should experiment a little bit more with the plugins and then make a few decisions on the parameters so that you can meet your target requirements.

ishmael1985 commented 6 years ago

@arbishop , can we close this? I hope that you are satisfied with the answers for now. Btw, some encoding optimizations have been introduced in my master branch, you may want to try it out - https://github.com/ishmael1985/gstreamer-media-SDK

arbishop commented 6 years ago

You can close this as "won't fix" if you'd like. This ticket wasn't really opened to ask a question but to raise an issue with performance.

If your branch can solve this, or perhaps bring it more in line with sample_multi_transcode (or perhaps ffmpeg?), then I would recommend pull requesting your patchset here.

It's unlikely I will circle back to this in the near term to provide more analytics or commits myself. However, that time will come eventually.

ishmael1985 commented 6 years ago

I guess we'll let this as is for now, will keep you posted for updates.

intel / gstreamer-media-SDK

Performance - sample_multi_transcode Significantly Faster for Same Operation #97

Intel: sample_multi_transcode

GStreamer: gst-launch-1.0