microsoft / DirectX-Graphics-Samples

This repo contains the DirectX Graphics samples that demonstrate how to build graphics intensive applications on Windows.
MIT License
5.88k stars 2k forks source link

Direct3D 12 Motion Estimation use case (resources preparation onboard - pixelshift) #863

Open DTL2020 opened 3 months ago

DTL2020 commented 3 months ago

I already made use of the DX12 ME API for simple motion vector search (from uploaded 2 frames) some time ago. It is now an option for mvtools open source motion estimation library for Avisynth. All implementation (resources initialization and each pair of frames processing is in the file https://github.com/DTL2020/mvtools/blob/mvtools-pfmod/Sources/MVAnalyse.cpp ).

After some time of practical usage for denoising I found the next onCPU processing is slow enough so typically the hardware accelerator is significantly underloaded and has lots of free resources (using Hardware Encoder performance counter).

Currently I got an idea how to make motion estimation more stable with the typical input noise damaged sources and test it with onCPU implementation of ME algorithm. The idea is for block-based ME engine: send a sequence of pixel-shifted frames to ME engine in small area around 'current' block position in the blocks tessellation grid (like +-1 pixels in diagonal shifting and up to +-4 pixels for 8x8 block size (half block size)) and after receiving array of motion vectors from ME engine apply some averaging processing (like computing Mean/Median/Mode of the MVs) to compute more stable motion vector for the block.

I am no good in DX12 programming and do not know good current possible (best) ways in the DX12 API how to shift (crop rectangle from the uploaded to hardware adapter a bit padded frame/resource/texture) 2D texture resource using some method inside accelerator and save from uploading lots of very slightly shifted copies of current and ref frames to accelerator to form a queue for DX12 Motion Estimator engine.

Is a simple 'Crop Rectangle' available with DX12 resources (valid to use with DX12 Motion Estimator object) ? So can I upload current and ref frames once (slightly padded about +8 pixels max in width and height) and simply set cropping rectangle for required +-N pixels shifting of 'working frame' area from bigger 2D texture resource before each ME task inserting in the total work queue for ME engine ? It looks like the fastest and cheapest way without actual data copy anywhere (neither host->accelerator nor inside accelerator RAM).

Next idea is to allocate required number of new 'working resources' in the DX12 hardware accelerator (from 5 to 17 depending of the number of tested positions of the block center and some surrounding area) and make copy of single uploaded resource using some cropping way in the all 'working resources' and provide these 'working resources' textures as current and ref frames to ME object. But it requires real physical copy operation inside hardware accelerator RAM and also takes some redundant RAM resources of the onboard RAM of the accelerator. Current and ref UHD 4K frames in 17 shifted copies will take about 500 MBytes and with several threads working (up to current CPU cores available at host) it may easily exhaust 4..8 GB RAM GPU boards. Also some RAM may be used for other tasks. So it is not the best idea.

And the worst case is to make offset shifting with the host CPU and allocate required resources on GPU and upload all shifted frames as we today upload current and ref frames to send to ME engine. It will create lots of bus traffic and may take lots of CPU time (or DMA from host RAM to GPU ?). Maybe it is possible to set something equal to 'crop rectangle' at the creation of DMA transfer of the padded texture resource in the GPU RAM resource ?

Or maybe other more nice ways exist to solve the task of providing slightly shifted copies of current and ref texture resources to DX12 ME objects to analyse ? The total number of copies of each resource are from 4 (center block and 4 diagonal +-1 shifted positions) to 17 (center block and 16 shifted positions from +-1 pixel to +-4 pixels diagonally).

Currently resources of current and ref textures to send to ME object are created as `D3D12_RESOURCE_DESC textureDesc = {}; textureDesc.MipLevels = 1; textureDesc.Format = DXGI_FORMAT_NV12; textureDesc.Width = nWidth; textureDesc.Height = nHeight; textureDesc.Flags = D3D12_RESOURCE_FLAG_NONE; textureDesc.DepthOrArraySize = 1; textureDesc.SampleDesc.Count = 1; textureDesc.SampleDesc.Quality = 0; textureDesc.Dimension = D3D12_RESOURCE_DIMENSION_TEXTURE2D;

HRESULT res_current_texture = m_D3D12device->CreateCommittedResource( &CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT), D3D12_HEAP_FLAG_NONE, &textureDesc, D3D12_RESOURCE_STATE_COMMON, nullptr, IID_PPV_ARGS(&spCurrentResource));`

If it is required to crop-copy of texture resources - it looks required D3D12 method is ID3D12GraphicsCommandList::CopyTextureRegion(). But for performance reasons is it possible to 'interleave' Graphics command list with VideoEncoder command list (using resource barriers if possible/required ?) so to form only 2 'parallel' command queues for motion estimation of 'center' position and several shifted like: Center of block: D3D12GraphicsCommandList::CopyTextureRegion(crop-center to inputArgs of Motion Estimator texture resource) ResourceBarrier ? VideoEncodeCommandList->EstimateMotion(spVideoMotionEstimator.Get(), &outputArgsEM, &inputArgsEM); ResourceBarrier ? Top Left offset: D3D12GraphicsCommandList::CopyTextureRegion(crop-topleft to inputArgs of Motion Estimator texture resource) ResourceBarrier ? VideoEncodeCommandList->EstimateMotion(spVideoMotionEstimator.Get(), &outputArgsEM, &inputArgsEM); ResourceBarrier ? Top Right offset: D3D12GraphicsCommandList::CopyTextureRegion(crop-topright to inputArgs of Motion Estimator texture resource) ResourceBarrier ? VideoEncodeCommandList->EstimateMotion(spVideoMotionEstimator.Get(), &outputArgsEM, &inputArgsEM); ResourceBarrier ? Botton Left offset: D3D12GraphicsCommandList::CopyTextureRegion(crop-topleft to inputArgs of Motion Estimator texture resource) ResourceBarrier ? VideoEncodeCommandList->EstimateMotion(spVideoMotionEstimator.Get(), &outputArgsEM, &inputArgsEM); ResourceBarrier ? Bottom Right offset: D3D12GraphicsCommandList::CopyTextureRegion(crop-topright to inputArgs of Motion Estimator texture resource) ResourceBarrier ? VideoEncodeCommandList->EstimateMotion(spVideoMotionEstimator.Get(), &outputArgsEM, &inputArgsEM); ResourceBarrier ?

GraphicsCommandList->Close() VideoEncodeCommandList->Close() Execute, Wait for complete.

Using same input resources for MotionEstimator (to reuse RAM in the accelerator) also to create all required command lists first and wait for complete only once. Or the long sequence only possible with Executing of GraphicsCopy command list for each crop-copy and next call MotionEstimation (reset-fill-close-execute-wait) ?

The large-RAM usage way is to aggregate all CopyTextureRegion crop-copy operations to the separated texture resources in the single GraphicsCommandList (close-execute-wait) and next aggregate all MotionEstimation commands in the VideoEncodeCommandList and close-execute-wait. But it maybe slower because of 2 sequential wait operations ?

Maybe possible some way to create cropped 'texture resource view' from single uploaded padded texture somehow ? So we can send these 'resource views' as input to MotionEstimator object and save from copy operation and large RAM usage for lots of texture copies.