intel / libvpl

Intel® Video Processing Library (Intel® VPL) API, dispatcher, and examples
https://intel.github.io/libvpl/
MIT License
262 stars 80 forks source link

Avoid repeated initialization of d3d9 or d3d11 devices to reduce mfxinit call time #137

Closed xelement closed 1 month ago

xelement commented 1 month ago

Issue

Calling the mfxInit function is too time-consuming. We can even monitor in the production environment that calling mfxinit takes more than ten seconds.

I have carefully reviewed the code of libvpl and added a lot of testing code. There are several time-consuming areas:

1. Repeated reinitialization of d3d devices

Every time the MFX:: MFXLibraryIterator:: Init function is called, it will reinitialize the d3d device, which is quite time-consuming, especially for d3d9, where single creation is time-consuming, which is also a key time-consuming aspect. The entire mfxinit process seems to have a lot of room for optimization, but adjusting the entire process is quite complex. Simply adjust the d3d initialization process first, and optimizing here can greatly reduce the calling time of mfxinit

2. MFXVideoCORE_QueryPlatform

Without code, it is unclear why this function is time-consuming. Perhaps we can skip calling this function. Since only one libmfxhw32/64.9ll has been found, if hardware encoding and decoding are not supported, the final calling 'mfxInit' will eventually fail ?

Why do we need to optimize the time consumption of mfxInit

  1. For many RTC scenarios, mfx initialization is too time-consuming and affects the entire program processing, making it longer. For example, for 1V1 video calls, using mfx for encoding and decoding means that both parties need a long time to see each other

  2. The initialization of MFX is too time-consuming, which makes it difficult to handle and requires a lot of code to adapt to this time-consuming scenario. For example, we need to place the initialization process on an independent thread. If there are other modules that rely on MFX initialization to complete, we need to wait for the thread to end

Solution

1. All Library Iterator process use only one 'Dxva2Device'

How: step 1:Optimized DXVA2Device class 1.1 Add two protected members 'D3D9Device m_d3d9Device' and 'DXGI1Device m_dxgi1Device' for 'InitD3D9' and 'InitDXGI1' function 1.2 For d3d9, dynamic call 'Direct3DCreate9' only once while m_pD3D9 was not initialized 1.3 For dxgi, same as d3d9

step 2:Since MFXLibraryIterator should need 'DXVA2Device ' instance to do some worker, we add as constructor parameter

step 3:In function "MFXInitEx", add variable 'DXVA2Device dxvaDevice', used by object 'libIterator' and function 'MFX::SelectImplementationType' as parameter

step 4: In addition,This optimization only for leagacy mfxInit,I add function 'SelectImplementationType2' with C style name only used by 'mfx_dispatcher_vpl_msdk.cpp' to avoid affect vpl functions.

2. New VPL interface also has this problem

We did not use the new interface of VPL. In order to avoid affecting the functionality of VPL, this optimization does not affect the interface of VPL. We will migrate to the VPL interface as a whole and then synchronize the modifications to the interface code of VPL image

How Tested

1. Local test

I wrote a test case that used different combinations of IMPL and VIA to call the mfxInit function, and counted the time spent before and after the call. I randomly ran the old and optimized versions multiple times on my device. Below are the comparison results of two random runs, sts is the result returned by mfxInit, and cost is the time spent calling mfxInit The optimized version significantly reduces the time consumption of mfxInit

my device : cpu : ' Intel(R) Core(TM) i9-14900HX' intel driver version : 31.0.101.4577

1.1 Optimized version rand 1 image rand 2 image

1.2 Old version rand 1 image rand 2 image

2. The effectiveness of our app verified in the production environment

This modification has been validated in the production environment through our app and has been running for over 4 months on over 10 million devices. The modification runs well in the production environment without any large-scale problem feedback

mav-intel commented 1 month ago

Hi @xelement, We do not plan to accept changes to depreciated APIs. Of course you are welcome to use them in your implementation.

jonrecker commented 1 month ago

As noted in the programming guide MFXInit() and MFXInitEx() have been deprecated and applications should instead switch to MFXLoad()/MFXCreateSession() for initialization. For fastest initialization, MFXCreateSession() also provides a "low latency" option which skips device capabilities query. An example of enabling this path is with "-f" option in the vpl-timing sample. This is also the default option in the legacy sample tools (i.e. -dispatcher:fullsearch vs. -dispatcher:lowLatency switches). Session creation time with the low latency option is usually in tens of msec.

xelement commented 1 month ago

Thank you for your replies. Well... ,There are several reasons why we are cautious about upgrading libvpl. There are many crash report every time we update mediasdk in past years. libvpl is well designed and we will try to upgrade to the libvpl in the next few months.