robmikh / SimpleRecorder

A simple screen recorder using both the Windows.Graphics.Capture and Windows.Media.Transcoding APIs.
MIT License
219 stars 43 forks source link

Obtain raw pixels / bitmap for each frame? #9

Closed NickThissen closed 3 years ago

NickThissen commented 4 years ago

Hi Rob,

I just found this recorder sample app and thought it was a promising solution for what I'm trying to do. I want to be able to "record" my application and stream the recorded frames over NDI to a Tricaster hardware. All I need for that is to have a bitmap (or just raw pixels I guess) of each frame of the recording, in real-time (near 60 FPS if possible).

This sample is able to record a video of my app at 60 FPS easily, but I struggle to convert it to my use-case where I need the individual frames (as bitmaps / pixels) in real-time.

Would it be possible to grab each frame in real-time and somehow get the pixels in some format that I can use? I am not interested in storing a video, or even an image file, I just need to have the pixels so I can stream them over NDI.

robmikh commented 4 years ago

Sorry about the delay... for some reason I'm no longer watching any of my repos? :-/

The bottleneck in your case is probably transferring the bits from video memory to system memory. Video encoding typically can be done through silicon on the GPU, so it doesn't have to travel across the bus each frame.

I'm guessing you're using the D3D11 Map API to get the bits into system memory?

How does NDI work? Do you have to respond to some callback, or can you just submit the frames to it with a timestamp? If it's the former, you'll have to build something similar to the CaptureFrameWait. If it's the latter, I'd drive it off of the frame pool's callback. Even better if there's no thread affinity, so you can use Direct3D11CaptureFramePool::CreateFreeThreaded to avoid marshaling to another thread. From there you should call CopyResource into a staging buffer that is marked for CPU read access, and then Map that staging buffer and pull the bits out to give to NDI.

NickThissen commented 4 years ago

Hi Rob.

After some days of researching and trial and error I arrived at a solution that seems somewhat workable (and remarkably similar to your suggestions). But it's still not "great" and noticeably below 60 FPS. If you have any suggestions on where I could still improve I would be very interested. I know absolutely nothing about this stuff and it only recently became clear to me that I could not just "take pixels from the GPU" and there is a large overhead getting them into cpu memory.

Anyway, I came to one solution which was quite slow, and then managed to improve it quite a bit.

First of all I am indeed using the 'OnFrameArrived' method and grabbing each frame, trying to get the data into cpu memory and sending it to NDI. I understand now I need to convert the Texture2D into a copy which is accessible by the cpu, and I do use CopyResource and MapSubresource for this. This gives me a DataStream object. Originally, I was reading this stream into an array of bytes, which turned out to be my pixels in (probably luckily) the correct format. Finally, I send those pixels to NDI. In the NDI part, all I do is create a BitmapImage from the pixel data, marshal some block of memory and copy the pixel data back into that memory using Bitmap.CopyPixels, then send the pointer to that memory to the NDI library.

I realized that this was adding quite some additional overhead which seemed unnecessary: first I'm reading the pixels from a stream, then I'm putting them into a Bitmap, and finally copying the pixel data back into memory. Instead of doing all this, I now simply send the DataStream.DataPointer to NDI. That skips two heavy steps: reading the stream and then creating a bitmap from those pixels.

Unfortunately I did not manage to make it work by directly sending the data pointer to the NDI library. There was no error, but the application simply stopped. Probably some access violation... Instead, I now use Buffer.MemoryCopy to copy the data at the pointer into my new memory location. That works! I can see the graphics being streamed over NDI. But like I said, additional performance increase would be very welcome!

Below is some relevant snippets of the code, simplified here and there but hopefully still relevant.

First I capture every frame and create a Texture2D "bitmap":

    private void OnFrameArrived(Direct3D11CaptureFramePool sender, object args)
    {
        using (var frame = sender.TryGetNextFrame())
        {
            using (var backBuffer = swapChain.GetBackBuffer<Texture2D>(0))
            using (var bitmap = Direct3D11Helper.CreateSharpDXTexture2D(frame.Surface))
            {
                d3dDevice.ImmediateContext.CopyResource(bitmap, backBuffer);
                GetBitmap(bitmap);
            }
        }
    }

In GetBitmap I handle this Texture2D and obtain the data pointer. There is also some left-over code (commented) where I read the stream into a byte array. Once the data pointer (or previously byte array) is there, I invoke an event that triggers the NDI library:

    private void GetBitmap(Texture2D texture)
    {
        var sw = new Stopwatch();
        sw.Start();

        // Create texture copy
        var copy = new Texture2D(d3dDevice, new Texture2DDescription
        {
            Width = texture.Description.Width,
            Height = texture.Description.Height,
            MipLevels = 1,
            ArraySize = 1,
            Format = texture.Description.Format,
            Usage = ResourceUsage.Staging,
            SampleDescription = new SampleDescription(1, 0),
            BindFlags = BindFlags.None,
            CpuAccessFlags = CpuAccessFlags.Read,
            OptionFlags = ResourceOptionFlags.None
        });

        // Copy data
        d3dDevice.ImmediateContext.CopyResource(texture, copy);

        d3dDevice.ImmediateContext.MapSubresource(copy, 0, 0, MapMode.Read, MapFlags.None,
            out DataStream stream);

        // Read bytes into memory (no longer necessary)
        //var bytes = Utilities.ReadStream(stream);
        BitmapCaptured?.Invoke(this, new CaptureEventArgs(stream.DataPointer, copy.Description.Width, copy.Description.Height));

        d3dDevice.ImmediateContext.UnmapSubresource(copy, 0);
        copy.Dispose();

        sw.Stop();
        Debug.WriteLine($"SW: {sw.ElapsedMilliseconds} ms");
    }

Finally, in NDI (heavily simplified), this is what's going on:

public void SendData(IntPtr ptr, int width, int height)
{
    int xres = width;
    int yres = height;

    // Calculate buffer size
    stride = (xres * 32/*BGRA bpp*/ + 7) / 8;
    bufferSize = yres * stride;

    // allocate some memory for a video buffer
    // this bufferPtr is send in the video frame below
    IntPtr bufferPtr = Marshal.AllocHGlobal(bufferSize);

    // We are going to create a progressive frame at 60Hz.
    // Note "p_data" parameter which points it to the memory location
    NDIlib.video_frame_v2_t videoFrame = new NDIlib.video_frame_v2_t()
    {
        // Resolution
        xres = xres,
        yres = yres,
        // Use BGRA video
        FourCC = NDIlib.FourCC_type_e.FourCC_type_BGRA,
        // The frame-eate
        frame_rate_N = frNum,
        frame_rate_D = frDen,
        // The aspect ratio
        picture_aspect_ratio = aspectRatio,
        // This is a progressive frame
        frame_format_type = NDIlib.frame_format_type_e.frame_format_type_progressive,
        // Timecode.
        timecode = NDIlib.send_timecode_synthesize,
        // The video memory used for this frame
        p_data = bufferPtr,
        // The line to line stride of this image
        line_stride_in_bytes = stride,
        // no metadata
        p_metadata = IntPtr.Zero,
        // only valid on received frames
        timestamp = 0
    };

    // Copy the data
    // If I point NDI directly to the original data "ptr", the app crashes
    Buffer.MemoryCopy(ptr.ToPointer(), bufferPtr.ToPointer(), bufferSize, bufferSize);

    // add it to the output queue
    AddFrame(videoFrame);
}

In my measurements, I found that CopyResource and MapSubresource were taking up most of the time, probably around 15-20 ms in total. Then the final copying in NDI (Buffer.MemoryCopy) takes another 4 ms. Some additional overhead here and there brings it to about 25-35 ms per frame, so around 30-40 FPS, whereas I am really hoping for 60...

This is at 1080p resolution, probably reducing that will help but I need to support 1080p at the very minimum.

Sorry, quite long comment, and maybe not so relevant anymore to this particular repo. But I feel like I made a lot of progress and it's frustrating to still be stuck just not quite good enough!

If you have any comments on how I can improve further I would be very interested. Do you think doing this in C# is perhaps causing additional overhead, and could I do it faster in C++ or something? I know nothing about that but willing to learn if it is necessary.

Thanks!!

NickThissen commented 4 years ago

Sorry, one more additional question. I notice that the quality of the captured screen is not "perfect". There seems to be some compression going on out of the box. I don't think it's caused by NDI, as I can see the same compression on the "preview" surface, and the compression is not there when I use other NDI methods.

Is there any way to avoid the compression? Hopefully without killing performance even more? Since my bottleneck seems to be gpu->cpu conversion and not the capture itself I am hoping that reducing the compression is not going to drop my performance.

There are quite some settings in the whole setup of the screen capture, but changing even a single one of them will cause an error that I 'made a call that is invalid' due to incorrect parameters. Is there any change possible or not at all?

robmikh commented 4 years ago

A couple things:

Is there a reason that you need to transmit the raw frames like this? Usually when people are trying to send information like this, they encode the data in some way and then send it across the wire. Typically this is done with H.264 or H.265, since there's a lot of hardware support for it. I myself have created a "duct tape and paperclips" streaming application that transmitted raw frames with my own protocol, and I can't really get past ~25 fps. Although it was a debugging tool where it needed the raw frames.

As for the parameters on setting up a capture, there's not a whole lot to configure. The pixel format can either be BGRA8 or FP16 (the latter is used for HDR, although we currently do no boosting of SDR content). Other than that you have the number of buffers and the size of those buffers.

robmikh commented 4 years ago

I guess one thing that I didn't mention is that if you don't have ordering problems or if you're able to mitigate those ordering problems using the provided timestamps, you could use more than 1 buffer in your frame pool. Although using more than 2 or 3 is definitely overkill.

NickThissen commented 4 years ago

Thanks! Really helpful. I need some time to digest some of these things, like I said I don't know much about this :)

Quoting you and with my comments below:

Thanks again!

NickThissen commented 4 years ago

Quick update: some further optimization from your suggestions made me reach between 4 - 10 ms per frame, at least from the code I could time. So I think this is good enough, should be more than enough for 60 fps (16 ms). However the feed via NDI still doesn't look as smooth as 60 fps... Maybe I am hitting some NDI limit now after all. It is a lot better though! Will test more soon. I also still need to test it on an updated W10 where the transparency bug is fixed, hopefully that doesn't mess with the performance. Thanks!

robmikh commented 4 years ago

Quick note: By "other side" I mean whomever is going to receive the frames you are transmitting via NDI, not NDI itself. In other words, the PC on the other end of "the wire" (network).

I'm glad there was an improvement! I'm guessing the smooth-ness issue is caused by uneven frame timing on the other side. If you can use the timestamps provided by the capture API, you may be able to correct the frame timings. I'm not sure how NDI works or how it tries to deal with variable network latency, or if it doesn't and that's up to the caller.

Finally, the comment I made about additional buffers was not about the swap chain, but about the Direct3D11CaptureFramePool. Controlling the lifetime of the Direct3D11CaptureFrame does impact the rendering the dwm does. If there are no buffers available, the dwm will not render a new frame for you. So if you only have one frame in your frame pool and you're holding onto the frame when the dwm tries to draw the next one, it will skip until it becomes available at a later v-blank. Having multiple buffers available mitigates this problem, but it also means you'll need to consider some side-effects. Usually this comes up if you have many threads and process the frames in parallel without guarantees of what processing will finish first (which is why I made the comment about the timestamps).

Hopefully this helps!

robmikh commented 4 years ago

Sorry, I overlooked your comment about NDI providing its own timestamps. I'm guessing those timestamps are generated when you submit the frame for transmitting? I would try not to use those timestamps (if possible), as those will be effected by your own processing and won't reflect when those frames were rendered. This is probably contributing to a frame pacing issue.

NickThissen commented 4 years ago

Thanks again for the help Rob. I made some further improvements and can now send the frames with only 1-2 ms lag. The output often seems nearly identical to the "real" screen, although sometimes there seems a bit of a hiccup. I still think your comment about the timings is another improvement I can make, I can see where to send the frame time for each frame in NDI but somehow it is not making any difference. Maybe I'm just using it wrong, I'll check further. To get the frame time from the captured frame I use frame.SystemRelativeTime, that is correct right?

Then I wanted to bring the quality topic up again. I have checked everything but I do still see a small difference in quality. Below is a screenshot of a set of two black bars. The bottom two bars are directly rendered on my screen in a WPF transparent window (this is the quality I am looking for). The top two bars are the output of the captured frame. There is a noticeable difference in quality, for one the top bars have a slightly different color, and there seems to be some jpeg kind of compression going on. If this is the best it's going to get, I think it's OK, the difference is small. But if there is any setting I can play with to change it I'd be happy to test it out!

NDI-vs-WPF

robmikh commented 4 years ago

Hey, sorry for taking so long to get back to this. Unfortunately, I can't really see differences here and to add a problem on top of that, this image is itself encoded. If you could save the raw pixels coming from the capture API, I could take a look at that. But as-is, I would suggest something may be going on in other parts of the pipeline unrelated to the capture API.