Accessing screen capture image bytes with MapFlags.DoNotWait for better CPU usage

NickThissen commented 1 year ago

Hello,

I am trying to get access to the byte data of the screen capture textures from the WPF Screen Capture sample, similar to #78. I am running the screen capture in a continuous loop, receiving every frame (144 times per second if my monitor refresh rate is 144 Hz). My goal is to send these 144 frames of "video" (or image data) over the network. I use NDI to send the frames, and all I need is a memory address where the data is stored (and CPU can access it).

I have previously solved this problem with the help of some people here via the following steps: Every time a new frame arrives:

Create a staging Texture2D of the same size and format, with Staging usage and Read access for the CPU
Use MapSubresource to copy the frame to this staging texture, using MapFlags.None.
Once finished, MapSubresource returns a DataBox with the memory address pointer
Done, now I can just send that.

Some sample code:

Create the staging texture:

private Texture2D CreateStagingCopy(Texture2D texture)
        {
            return new Texture2D(d3dDevice, new Texture2DDescription
            {
                Width = texture.Description.Width,
                Height = texture.Description.Height,
                MipLevels = 1,
                ArraySize = 1,
                Format = texture.Description.Format,
                Usage = ResourceUsage.Staging,
                SampleDescription = new SampleDescription(1, 0),
                BindFlags = BindFlags.None,
                CpuAccessFlags = CpuAccessFlags.Read,
                OptionFlags = ResourceOptionFlags.None,
            });
        }

MapSubresource and sending the data:

        private void SendFrameData_v1(Texture2D texture, Direct3D11CaptureFrame frame)
        {
            // Create a CPU-accessible staging texture and copy the captured frame to it
            var copy = CreateStagingCopy(texture);
            d3dDevice.ImmediateContext.CopyResource(texture, copy);

            // Map the resource using 'MapFlags.None' -> this call waits until it is completed and the data is accessible
            // This takes up the majority of the time and CPU usage
            var dataBox = d3dDevice.ImmediateContext.MapSubresource(copy, 0, MapMode.Read, MapFlags.None);

            // Send the data over the network
            var time = frame.SystemRelativeTime.Ticks;
            var capturedFrame = new CapturedFrame(d3dDevice.ImmediateContext, dataBox.DataPointer, copy, time);
            _sender.SendData(capturedFrame);

            // Note: UnmapSubresource is called after the frame has been sent, not shown here.
        }

I guess making a new staging texture for every frame is unnecessary, but I found there to be no measurable performance difference if I re-use one.

This works great, but I believe it is still not optimal because of the "MapFlags.None". My understanding is that this makes the call wait for the GPU to finish copying to the CPU. While it does not take a massive amount of time, it is still something that is causing an unnecessary delay: the CPU is "busy" (doing nothing) waiting for the frame to arrive. This causes the CPU usage to be high and other applications start slowing down because of it.

My goal is to achieve the same performance (144 Hz sending) but with lower CPU usage. I believe the key is to use MapFlags.DoNotWait instead. This will cause the MapSubresource call to return immediately and the CPU is not left waiting. However, the data isn't available yet, so I have to do something else to get to the data, sometime later.

What I came up with is some kind of (probably terrible) buffering system. The logic would be as follows:

Every frame create a new staging texture copy and add it to a list of previous frames: the buffer
Now loop over all frames in the buffer backwards, and call MapSubresource on each one (with DoNotWait flag so it does not take any time)
Keep looping until I find the first one that has any data available (actually: the last one, cause I'm looping backward).
Send this frame, and remove all older staging textures as they aren't needed anymore.

        // Keep a cycling list of staging textures
        private List<Texture2D> _buffer = new List<Texture2D>();

       private void SendFrameData_v2(Texture2D texture, Direct3D11CaptureFrame frame)
        {
            // Create a CPU-accessible staging copy
            var copy = CreateStagingCopy(texture);

            // Store it for sending later (once GPU is finished)
            _buffer.Add(copy);

            // Copy the captured frame to this staging texture
            d3dDevice.ImmediateContext.CopyResource(texture, copy);

            // Find the most recent frame that has data available (GPU has finished copying)
            // Loop over all buffers backwards, MapSubresource and find the first one that was finished
            var sendIndex = -1;
            IntPtr dataPointer = IntPtr.Zero;
            for (var i = _buffer.Count - 1; i >= 0; i--)
            {
                // Map the resource with MapFlags.DoNotWait -> this call returns immediately and I check if the dataBox is empty or not
                var dataBox = d3dDevice.ImmediateContext.MapSubresource(_buffer[i], 0, MapMode.Read, MapFlags.DoNotWait);
                if (!dataBox.IsEmpty)
                {
                    // Found the latest staging texture in the buffer -> this is the one I'll send
                    sendIndex = i;
                    dataPointer = dataBox.DataPointer;
                    break;
                }
            }

            if (sendIndex >= 0)
            {
                // This staging texture in the buffer had data available, now I can send it
                var sendTexture = _buffer[sendIndex];
                var time = frame.SystemRelativeTime.Ticks;
                var capturedFrame = new CapturedFrame(d3dDevice.ImmediateContext, dataPointer, sendTexture, time);
                _sender.SendData(capturedFrame);

                // I no longer need any older textures in the buffer, so dispose and remove them all
                // Clear this texture and all older ones
                for (var i = 0; i <= sendIndex; i++)
                {
                    var discard = _buffer[i];
                    discard.Dispose();
                    _buffer.RemoveAt(i);
                }
            }

            // Note: in this case I do NOT call UnmapSubresource anywhere!! If I do, I get no data.
        }

While this works decently well, I don't see a great improvement to the CPU usage yet. A bigger problem is that I am no longer calling UnmapSubresource on any of these textures anymore. The moment I try it (anywhere), I get an 'access denied' error. However it seems to run OK without using UnmapSubresource at all, I don't see any memory build up.

I'm sure I am still doing something wrong and this can be optimized better. Does anyone have any tips on how I can achieve it?

One potentially important note; I am omitting the "_sender" code here, but it essentially keeps a queue of frames and sends them off at the desired framerate in a background thread (which blocks for the appropriate amount of time to reach the desired sending frame rate). Steps are the following:

Add every frame to a queue (144 frames per second) as soon as it arrives.
At the desired sending framerate (could be lower than 144), take the latest frame from the queue and send it.
If there are any remaining frames in the queue, discard them all. -> This happens when sending framerate is slower than screen capture frames come in.
If there is no frame in the queue, skip a frame. -> This happens when sending framerate is higher than receiving.

NickThissen commented 1 year ago

One thing I forgot: I realize the more usual way to use a buffer is to just keep two staging textures and cycle between them. One is being used in the background by the GPU to copy, while the other one should be "ready to go" and I can access the data. Each iteration (frame arrived) I swap them over.

However, I did not manage to make this work, because I don't understand how I can guarantee that the texture will be ready by the time the next frame arrives. And if it isn't ready, then I have no data at all. I started experimenting with 3 or even 4 textures in the buffer but in the end I decided keeping a list of arbitrary size (but always removing the unused ones) was the better choice...

robmikh commented 1 year ago

This is more of a D3D11 usage question, you'll get better answers from that community. The DirectX folks have set up a Discord server you can join: https://devblogs.microsoft.com/directx/hello-discord/

As for your question, I would use a collection of staging texture with two concurrent queues while moving the encoder/sender to some other thread. I'll refer to the two queues as the 'free' queue and the 'busy' queue. When you receive a frame, pull a staging texture out of the free queue and copy the frame into it. Then put it on the busy queue.

Later, the encoder/sender thread will pull off of the busy queue, map the texture, and then send the bytes across the wire. When it's done, it'll put that texture back onto the free queue.

But this will all be contingent on what scenario you're after, and how you encode and send data. I highly recommend identifying metrics you can measure and profiling your application.

I'm going to close this issue, as it's outside the scope of these samples. Good luck!

GF-Huang commented 7 months ago

Hi guys, any progress?

microsoft / Windows.UI.Composition-Win32-Samples

Accessing screen capture image bytes with MapFlags.DoNotWait for better CPU usage #115