Latency is oddly high - Githubissues

SebastianKunz commented 5 years ago

Hello again :), I recently managed to capture my desktop and encode it with h264 on my GPU (Geforce GTX 1060). I capture the frames with 4k resolution and max fps (basically in a while(true) loop. This should be fine, because the encoder will drop unnecessary frames). Before encoding the frames, I scale them down to full HD (1920x1080). I am streaming from my C# application to the browser and experience about 100ms of delay. I was wondering how that could be. So I looked at what could introduce latency.

Capturing (I am using SharpDX to caputre the screen and send the gpu texture to the encoder)
Encoding (Geforce GTX 1060, this should be fast. As comparison I tried ffmpeg and 1500 frames were encoded in about 2.8 seconds. I tested a few more scenarios, mainly different encoder settings, but in no case 1 frame took longer than 5ms to encode.)
Sending the frames (can be ignored, because I am not leaving the local network)
Decoding

There is also one more component that might introduce latency. In VLC player you can specify a buffer size. I looked around, trying to find something like that for the HTML5 VideoTag, but without any success. Maybe you know more about it?

I measured that capturing takes about 50ms and encoding another 30ms. This seems a little high... Any ideas on reducing latency furthermore?

ziriax commented 5 years ago

That is very strange, we are using this library in our own project we achieve very low latency (no more than 32ms).

How do you capture? Using DXGI desktop replication?

If you run the WebRTC web demo in this repository, and then click and drag the mouse in the Chrome browser window, does that have high latency?

SebastianKunz commented 5 years ago

The webrtc-web-demo works fine. To be honest it's difficult to figure out what exactly causes the latency, because webrtc is so nested and async. I am using TryAcquireNextFrame, which is calling Desktop Duplication API. It's hard to belive that the encoding takes that long. However there is also an agressive GPU consumption. At about 50% on the dedicated Encoder unit. With ffmpeg it was significantly lower, at about 5%. Are you experiencing similar GPU usage?

This is the code I am using to capture the desktop, scale it down to full HD and then send it to the encoder.

public Task StartCapture()
        {
            _isStreaming = true;
            var factory = new Factory1();
            //Get first adapter
            var adapter = factory.GetAdapter1(0);
            //Get device from adapter
            var device = new SharpDX.Direct3D11.Device(adapter);
            //Get front buffer of the adapter
            var output = adapter.GetOutput(0);
            var output1 = output.QueryInterface<Output1>();

            // Width/Height of desktop to capture
            int width = output.Description.DesktopBounds.Right;
            int height = output.Description.DesktopBounds.Bottom;

            var textureDesc = new Texture2DDescription
            {
                CpuAccessFlags = CpuAccessFlags.Read,
                BindFlags = BindFlags.None,
                Format = Format.B8G8R8A8_UNorm,
                Width = width / 2,
                Height = height / 2,
                OptionFlags = ResourceOptionFlags.None,
                MipLevels = 1,
                ArraySize = 1,
                SampleDescription = { Count = 1, Quality = 0 },
                Usage = ResourceUsage.Staging
            };

            var smallerTextureDesc = new Texture2DDescription
            {
                CpuAccessFlags = CpuAccessFlags.None,
                BindFlags = BindFlags.RenderTarget | BindFlags.ShaderResource,
                Format = Format.B8G8R8A8_UNorm,
                Width = width,
                Height = height,
                OptionFlags = ResourceOptionFlags.GenerateMipMaps,
                MipLevels = 4,
                ArraySize = 1,
                SampleDescription = { Count = 1, Quality = 0 },
                Usage = ResourceUsage.Default
            }; 

            var smallerTexture = new Texture2D(device, smallerTextureDesc);
            var stagingTexture = new Texture2D(device, textureDesc);
            var smallerTextureView = new ShaderResourceView(device, smallerTexture);

            return Task.Factory.StartNew(() =>
            {
                // Duplicate the output
                _duplicatedOutput = output1.DuplicateOutput(device);

                    while (_isStreaming)
                    {
                        // Try to get duplicated frame within given time is ms
                        var res = _duplicatedOutput.TryAcquireNextFrame(5, out _, out var screenResource);

                        if (res.Failure) {}

                        if (res.Success)
                        {
                            using (var screenTexture2D = screenResource.QueryInterface<Texture2D>())
                            {
                                device.ImmediateContext.CopySubresourceRegion(screenTexture2D, 0, null, smallerTexture, 0);
                            }

                            device.ImmediateContext.GenerateMips(smallerTextureView);

                            device.ImmediateContext.CopySubresourceRegion(smallerTexture, 1, null, stagingTexture, 0);

                            _videoTrack.SendVideoFrame(stagingTexture.NativePointer, 0, stagingTexture.Description.Width, stagingTexture.Description.Height, VideoFrameFormat.GpuTextureD3D11);

                            screenResource.Dispose();
                        }
                    }
                _duplicatedOutput.Dispose();
            });
        }

Then on LocalVideoFrameProcessed I release the frame.

_videoTrack.LocalVideoFrameProcessed += (pc, trackId, pixels, isEncoded) =>
{
  _duplicatedOutput.ReleaseFrame();
};

ziriax commented 5 years ago

Hard to say, I am not a D3D11 expert myself really.

Are you using the D3D11 multi threaded features?

Could you test without creating a task?

You are generating 4 MIP levels, you only need 2 I guess?

SebastianKunz commented 5 years ago

Ok so I investigated the problem a bit more. It seems like that my time messurements were incorrect. Capturing takes about 30ms and encoding 5ms (depending on motion level). These numbers make a lot more sense. However they still don't add up to the 80-100ms of latency I am dealing with.

I never worked with D3D11 before so I have no idea. I guess I am not using multi threaded features (all the code, where I use sharpdx is posted above). Removing the Task.Run does not improve the latency.

To be honest. I don't really know what I'm doing, regarding SharpDX. I oriented myself on this stackoverflow post to scale down the image.

What are you streaming ? I'm guessing that you are encoding a less complex scene than I am, thus you are experiencing lower latency.

How do you measure latency? As of now I am streaming a stopwatch from in the browser and compare that to the received stream (taking a screenshot to get the exact same timestamp).

Does the browser intro any resonable latency so the math makes sense? It has to decode the frames and display them. Is there any buffering? Did you mess with any browser options?

Thank you very much. You've been a lot of help so far!

ziriax commented 5 years ago

What kind of latency do you want to measure? The time between an input device event on the capturer and the frame being displayed on the browser? Or the time between a rendered frame being captured and being displayed in the browser? Also, what clock are you using the in browser? Are you sure the clocks in the browser and capturer are the same and have no ms drift? I haven't tried any of that.

Typically when I measure end-to-end latency, I display the server-side rendered image in a window, and then display the decoded image in a web browser window on the same monitor. I make sure each frame has a time-stamp in it (you could also use simple bar with millisecond markers). Then I use a high speed camera to film the monitor. I never did that with this project, I should do it to get an idea of the end-to-end latency. This of course excludes network latency, then I use two monitors (ideally the same type).

SebastianKunz commented 5 years ago

I am streaming my desktop with this website open. I receive that stream in the web browser, opened on my second monitor. Then I use the screen capture tool lightshot to freeze the screen. So that both clocks on the monitors are paused. I then subtract the receiver time from the original time. This is how it looks like.

When I now move the window for example. I want to measure the time it takes to also move on the receiver end.

So if I get you correctly, you send a frame with a clock in it. Like I can see the timestamp just by looking at the stream, right? So when generating frames you manipulate the frames to add the timer, correct?

SebastianKunz commented 5 years ago

Oh and I am aware of jitter. That might also have an impact.

ziriax commented 5 years ago

Yes I add a visual time to the frames.

Nice trick with lightshot but that doesn't work anymore if you want to measure the latency between different systems. But for this scenario, it will work.

I am trying to add this stuff to my web demo, to measure the latency.

SebastianKunz commented 5 years ago

So I tested your web-demo with a 4k resoultion (for this I just changed VideoFrameWidth and VideoFrameHeight. Turns out that 4k causes problems. The received stream is no longer smooth with 60 fps. It drops down and bounces at around 10fps. Here is a screenshot of chrome://webrtc-internals. I didn't meassure it, but I'm assuming that rendering and encoding takes too long to fulfill the 60fps. To archive 60fps every 16.66ms a new frame needs to be displayed. This means that on the streamer side you have 16.66ms - decoding time - networking latency to generate the frame and encode it. We are only able to generate 10 fps meaning we need 100ms to capture and encode it (lets irgnore decoding and network traversal for now). As you know, I'm unexperienced with DirectX11, so I can't tell how long rendering roughly takes. But it should not take too long. I compared those numbers with ffmpeg again.

I captured the screen with 4k Resolution with this command ffmpeg -f gdigrab -video_size 3840x2160 -i desktop -c:v h264_nvenc -preset llhq 4k.mp4 Then I scaled it down using this command ffmpeg -i .\4k.mp4 -vf scale=1920:1080 -c:v h264_nvenc -preset llhq fullhd.mp4

I encoded the 4K video using following command. ffmpeg -i .\4k.mp4 -c:v h264_nvenc -preset llhq benchmark.mp4 -benchmark Results: 742 frames in 5.927s = 123fps = 8ms / frame

Then I encoded the FullHD video using following command. ffmpeg -i .\fullhd.mp4 -c:v h264_nvenc -preset llhq benchmark.mp4 -benchmark Results: 724 frames in 1.946s = 375fps = 2.66ms / frame

So turns out that 4k encoding is about 3 times slower compared to FullHD encoding.

What I want to say with this, is that we should be able to stream 4k @60fps. But we can't. The encoding takes too long. I tried to figure out what could cause this and I found following in PeerConnection::SendData in PeerConnection.cpp

if (format >= VideoFrameFormat::CpuTexture)
    {
        buffer = new rtc::RefCountedObject<webrtc::NativeVideoBuffer>(
            video_track_id, format, width, height, static_cast<const void*>(pixels), this);
    }
    else
    {
        auto yuvBuffer = webrtc::I420Buffer::Create(width, height);

        const auto convertToYUV = getYuvConverter(format);

        convertToYUV(pixels, stride,
            yuvBuffer->MutableDataY(), yuvBuffer->StrideY(),
            yuvBuffer->MutableDataU(), yuvBuffer->StrideU(),
            yuvBuffer->MutableDataV(), yuvBuffer->StrideV(),
            width,
            height);

        buffer = yuvBuffer;
    }

    const auto yuvFrame = webrtc::VideoFrame::Builder()
        .set_video_frame_buffer(buffer)
        .set_rotation(webrtc::kVideoRotation_0)
        .set_timestamp_us(clock->TimeInMicroseconds())
        .build();

    source->OnFrame(yuvFrame);

So in the best case the desktop frames only leave GPU memory once, and that is when it's copied so that it can be packetized to transfer it over the network. However it turns out that this is not the case with the current state of the code. When the YUV buffer is created the frame is copied into systems memory. Then in OnFrame somewhere the frame gets passed to the encoder. Thats another copy. This time from system memory to GPU memory. Then when the encoding is done the frame needs to be copied again to the system memory. So thats 2 more copies than needed. When we are talking about 4K resolution, thats a lot of bytes. This is most likely a bottleneck.

So how do we fix this?

To be honest. I don't understand webrtc enough yet, so I don't know if there is a way without copying the frame to a yuv buffer in system memory. Theoretically we should be able to only pass a pointer to that texture and don't do any copying. However I don't know if webrtc supports that. Do you know if we can just pass a pointer of the frame in GPU memory through webrtc or are we forced to copying?

Oh one more thing I noticed. I don't know if you are aware of this. You are rendering your scene at 60 fps. However as of now the encoder is hardcoded to encode at 30 fps. So you could save some resources there.

ziriax commented 5 years ago

Oh I thought you captured at 4K but downscaled to full hd, and then did the encoding... Yes, I expect 4K to be about 4 times slower for encoding. I guess it also depends on the GPU, recent RTX cards have more powerful nvenc chips.

Nothing is transferred to the CPU memory when you use the GPU video frame format, the YuvBuffer code will not be reached at all. Otherwise that is a serious regression bug.

But yes, the current code from NVidia does copy the 4K texture to another internal texture, that step is not needed for low latency encoding, I should rewrite that Nvidia code...

I'm curious to see how the web demo behaves with 4K video, but it is certainly not a scenario we need ourselves.

I also would like to experiment with NVENC sliceMode, as far as I understand the encoder is then able to provide encoded packets on the fly, but I don't think WebRTC can deal with this (it would overlap the encoding and the transmission of the data)

SebastianKunz commented 5 years ago

Oh I'm sorry. I do capture in 4K and scale it down to full HD, however I wanted to know if I can also stream in 4K resolution, and how my application deals with that. But I tested the webrtc-dotnet-web-demo with 4k resolution and there I only receive 10 fps (hence the screenshot).

Im a total dummy. I'm sorry. I was desperatly looking for a reason why the encoding takes so long and totaly missed the >= in the if statement. Of course. You are totaly right. I will investegate more in optimizing the encoding process. Thank you for sharing information about NVENC sliceMode. I will have a look at that. However I still can't explain myself what causes the high latency that is also present in the webrtc-dotnet-web-demo example. It has to be something with the encoding.

ziriax commented 5 years ago

Why would you say you're a dummy? Nobody even attempting to work with this WebRTC native technology can be called a "dummy" :-) This is crazy complicated stuff.

I will see how the bouncing balls web demo behaves on my GTX 1060 PC, I don't expect much from it, real-time low-latency streaming of 4K feels crazy ;-) But it is certainly interesting to benchmark it.

My bad, the >= statement is really a big hack, I shouldn't have done that. A basic rule is to use switch/case for that, but I have written all this code in rather short time frame, so you can expect more sloppiness

SebastianKunz commented 5 years ago

Sure there are some room for improvements in the codebase, but I'm very happy to have a codebase I can expand in the first place. I think it's impressive that you managed to get webrtc working. It's very frustrating to get into. Are you open for pull requests?

Have you looked at MoonLight? They are able to stream 4k at 60fps with zero latency (local network). This is something I am aiming for. I know that it is possible so I am very eager to archive similar results.

ziriax commented 5 years ago

Now this is interesting, I'm testing the web-demo in 4K, and get 60 FPS with about one frame latency (at least visually, not measured) when dragging the ball in Chrome...

This is not in a LAN, but on a single PC. But as the bandwidth is 5M, this shouldn't be a problem I guess.

I am running a Release build.

Please note that due a gigantic memory leak, you have to restart the server every time (this needs to be fixed asap)

Of course PR's are welcome, that is the main reason why this is open source :)

ziriax commented 5 years ago

Moonlight looks great, but is NVidia only it seems, also on the client side.

If you really aim for zero latency, in a LAN, and don't need a browser (it seems Moonlight requires a browser extension to be installed, or at least Shadowplay), then system's like Newtek's NDI or JPEG-XS will also work I guess.

SebastianKunz commented 5 years ago

Yes Moolight is Nvidia only. Moonlight works fine for zero latency but you it's not supported by nvidia. The creators of moonlight probably reversed engineered nvidias gamestream protocol and build the client around it. This means, when Nvidia updates the protocol moonlight breaks. It's more of a proof of concept and less an actual solution.

ziriax commented 5 years ago

I've push new code to master. The web demo now renders in 4K, and a server-side preview window is shown with a time ruler that allows measuring latency. I should have used a different branch, since it might not be stable...

On my local system, I get about 1/60th latency between the video frame being displayed on the server window and in the browser.

I'm going to test how this behaves in a LAN enviroment now.

Could you test this on a single local PC too? I'm curious to see if you experience high latency on a single PC too (in release build)

SebastianKunz commented 5 years ago

Thats great! I pulled the changes and here are my results. It get about 11-7 "marks" latency. I don't know what units these are in. Maybe you can clarify Screenshot. I also tried it on my local network. Framerate is roughly the same like on my local-pc-setup at 10fps. Here is a screenshot of the stats on the remote machine (in the local network).

So it turns something is limiting to fulfill 30fps on the server side. I'm guessing it's the encoding part. I have some new ideas that I want to try out and will report them to you once I'm done.

ziriax commented 5 years ago

One tick = 1/60th of a second

Are you sure you are running a release build?

I get 10 FPS in debug, but 60 FPS in release.

The encoding of 4K only takes about 8ms on my system

What kind of system do you have?

SebastianKunz commented 5 years ago

Woah! This is crazy! I never run my app in release mode. The difference is huuuuge! I do get 60 fps in release mode. Here is a screenshot. I'm blown away. My "screen-sharing" application also runs at 60 fps now. Amazing!

I do have one last question though. The ecoder options are currently set trough default encoder settings. There a framerate of 30fps is specified. However I do receive 60 fps. How is this possible? I expect the encoder to drop frames in this case. Or is there some mechanism that regulates the encoding framerate dynamically?

ziriax commented 5 years ago

Woohoo!

Yes, in C++ land, the difference between debug and release is insane.

In the newer version, I plan to allow debug .net builds to use the release WebRTC code... Because .NET people don't expect debug builds to be so slow. Also when this becomes a Nuget package, this won't be an issue any more.

Does this happen in your code, or also in the web demo?

I think the FPS is just a hint, if you application send 60 frames per second to the encoder, it might increase the FPS.

So you should try to send no more than 30 FPS yourself.

If you are doing DXGI desktop capture, then I guess you are sending 60 frames per second?

If the web demo also send 60 FPS when you pass 30, then that is a bug :-)

SebastianKunz commented 5 years ago

I am more of a C person than C++ or .NET. But I did not expect such an impact. As of now I am sending as many as possible, so I sometimes go beyond 60 fps. It's something I did not implement yet, but will. The web-demo renders correctly. When I set TargetFramerate to 60, I receive 60 in the browser. When I set it to 30 it only sends 30. So thats working.

All in all good job! Thank you for helping me out. I still have 80ms delay with 4K @60fps, but thats most likely on me. The way I capture frames is very shady. I think I can fix that.

Have a great weekend!

ziriax commented 5 years ago

80ms is an eternity in GPU land, so clearly something different must be going on in your case. Maybe the code to generate mip-maps is slow, I am not sure...

I would do the following:

try capturing 1920x1080, not generating any mip-maps, and observe the latency
try capturing 4K, and use CopySubresourceRegion to copy 1920x1080 out of it.
if that is all very fast, try to generate a single mip map, and see.
if that turns out to be slow, then you know something fishy is going on...
also make sure you don't use a single texture to capture into, since then the encoder can't run in parallel with the capturer.

SebastianKunz commented 5 years ago

I am back again :D. I completly rewrote the desktop capturer using this project as reference. I have a capture thread and an encoder thread. The capture thread gets the frame using desktop duplication api and then copies it into a one of the intermediate buffers. It then gets send to the encoder over the webrtc pipeline (videosink->sendFrame()). The encoder then copies the texture into an internal buffer. In general this works very well. Nevertheless there is something going on that I cannot find the source of. Sometimes the application just freezes silently. It just stops working, neither the capture thread continues nor the encoder. The general capture-encode flow looks like the following:

1. Capture the frame and copy it to available intermediate buffer 2. Send the latest frame to the video sink

Capture Loop ends here and starts on 1. again

The encoder runs on a seperate thread, it receives the over the webrtc pipeline, encodes it and then hands it over to webrtc again (I am using the encoder that comes with this project).

When I put a rtc::thread::SleepMs(15) right after the step 2 it works just fine. My assumption was that whenever the encoder gets a frame before it has encoded the previous it dies. I rewrote some code, but that didn't seem to fix the problem. I also changed the internal encoderOutputDelay from 1 to 3, which didn't help. When I run the encoder in parallel with the capturer, so I capture a frame encode it and then capture the next frame, it works fine. However I cannot archive a decent framerate and latency is also worse. Any pointers on how to fix/approach the problem further?

ziriax commented 5 years ago

At first sight you are encountering multi threaded issues. I guess you don't do anything to make the Direct3D calls atomic using?

Look at https://github.com/WonderMediaProductions/webrtc-dotnet-core/blob/f5030cd21b72841b1a3a38bf939559d2b3127747/webrtc-dotnet-graphics/VideoRenderer.cs#L177

You must also perform the same D3D11 thread locking around the DXGI capture.

Not sure if this will solve your issues

SebastianKunz commented 5 years ago

Yes excatly. I though about this and tried to solve the problem with std::mutex. Of course this doesn't make any sense, because the operations are done on the GPU and not CPU🤦‍♂ . I will give this a shot, and report back. Thank you for the hint!

ziriax commented 5 years ago

Yes, it's best to read this in depth

iimachines / webrtc-dotnet-core

Latency is oddly high #7