microsoft / MixedReality-WebRTC

MixedReality-WebRTC is a collection of components to help mixed reality app developers integrate audio and video real-time communication into their application and improve their collaborative experience
https://microsoft.github.io/MixedReality-WebRTC/
MIT License
910 stars 284 forks source link

Additional GPU-Accelerated Video Codec Support? #405

Open alterscape opened 4 years ago

alterscape commented 4 years ago

Describe the problem My team is working to add support for encoding video frames generated within Unity with a hardware-accelerated video codec, without unnecessary CPU<->GPU copies (AsyncGPUReadback). The use-case is related to streaming RGBD acquired with a Kinect Azure DK unit. This feature would require solving at least two related design challenges, but I'm opening this ticket as a starting-point for discussion.

First challenge: Registering additional video codecs with WebRTC and making them available for SDP negotiation. This looks like it's handled inside the low-level WebRTC implementation from Google, so we need to study that codebase and see how codec name strings are resolved to implementations, and how to register new implementations.

Second challenge: Using the codec with Unity and DirectX. We need a mechanism to pass encoded frames as a byte array into WebRTC on the sending end, and submit bytes received on the wire to our native plugin on the receiving end. The current API surface operates in terms of video frames, which is extremely convenient but makes it hard to implement a solution that encodes frames on the GPU without first copying raw pixels to the CPU, and vice-versa. It looks like the concept of a VideoFrame is deeply integrated into Google's WebRTC code so this may not even be possible without a fairly significant fork of that codebase, but I'd be happy to hear otherwise.

Describe the solution you'd like We're unsure how to best achieve this, but if this is a use-case that is of interest, we'd be happy to contribute whatever we come up with upstream. If you have pointers where to inject this functionality with minimal disruption, I'd be happy to base my work on that.

I think our ideal interface on the transmitting side looks something like video_track.SubmitFrame(NativeArray<byte> encoded_frame) or similar.

Implementing our own rtc::VideoSink almost looks like the right answer, but that doesn't seem to address our need to accept already-encoded bytes (avoiding "download from GPU with AsyncGPUReadback; upload again to on-GPU encoder, then download resulting bytes and transmit that.").

Describe alternatives you've considered We're not able to reach performance targets with VP8/VP9, so we need to look into alternatives. We're also noticing some latency that might be attributable to AsyncGPUReadback, and we'd like to minimize that.

Additional context N/A

djee-ms commented 4 years ago

Hi @alterscape,

We already have some insights about all of this, and there are some works by the Media Foundation team on the Chromium video capture code (which unfortunately we don't use yet, but hope to use once WinRTC make it available to us) to avoid copies and pass the frames directly in VRAM from capturer to hardware encoder.

First challenge: Registering additional video codecs with WebRTC and making them available for SDP negotiation.

This can be done in Google's implementation by registering a custom video codec factory, which injects your codec into the mix. See how WebRTC UWP SDK is creating a custom video codec factory then injecting it into the peer connection when it's created.

However, WebRTC cannot handle random codecs out of the box, and normally requires that the codec be supported by RTP. For each supported codec, there is a corresponding RFC describing how to transport it via (S)RTP; see for example the H.264 RFC here. It is my (limited) understanding that you will not be able to inject a codec for which Google did not already write an RTP implementation, but I might be wrong. This is also likely to break compatibility with other implementations, which is only a problem if you are not controlling both client apps on both peers.

Second challenge: Using the codec with Unity and DirectX. We need a mechanism to pass encoded frames as a byte array into WebRTC on the sending end, and submit bytes received on the wire to our native plugin on the receiving end. The current API surface operates in terms of video frames, which is extremely convenient but makes it hard to implement a solution that encodes frames on the GPU without first copying raw pixels to the CPU, and vice-versa.

There are 2 separate issues here:

  1. The performance implication of copying frames from VRAM to CPU after capture, then re-uploading from CPU to VRAM for hardware encoding before finally copying back again to CPU for network sending. The Media Foundation team is looking at fixing that for Chromium. The way this can be done is to use the special video frame type kNative. In short, you generate a kNative frame buffer containing only the D3D handle to the VRAM surface, and not any data, then pass that "frame" around from the capturer to the encoder, which can use it directly. The only issue is to make sure both your custom capturer and your custom encoder are used at the same time, otherwise other ones won't understand what that kNative frame is and nothing will work.

    A special enum value 'kNative' is provided for external clients to implement their own frame buffer representations, e.g. as textures. The external client can produce such native frame buffers from custom video sources, and then cast it back to the correct subclass in custom video sinks. The purpose of this is to improve performance by providing an optimized path without intermediate conversions.

  2. and submit bytes received on the wire to our native plugin on the receiving end.

    It is unlikely to ever be possible to transmit any random bytes. You have to use a supported codec, and on the receiving side you will get back your frame in CPU memory. You could argue there's a similar optimization to be done there from decoder to renderer, and to some extent it is maybe possible using the same kNative trick, but I am less comfortable that it can be done here (I never really looked into it).

Related, note that @eanders-MS was writing a native renderer for Unity to optimize performance of async GPU readback, if that's of any help. I think this work is not underway anymore, but the basic implementation is there at https://github.com/AltspaceVR/MixedReality-WebRTC/tree/eanders/native-rendering.

djee-ms commented 4 years ago

Just to manage expectations though, although I did have a look at all of this when looking at potential performance optimizations for HoloLens 2, we never wrote a single line of code for it. This is a very large chunk of work for us given our resourcing and it is therefore unlikely we'd be able to do it by ourselves while also managing this project. Of course we are welcoming contributions! 😉

alterscape commented 4 years ago

Thank you for the extensive documentation on what you looked at so far! I'll share this on with my team and see what we can do.

eanders-ms commented 4 years ago

@djee-ms, I'm glad you mentioned the native renderer. Work on this is still underway. It has been taken over by @rygo6-MS, who is working out the last issues with running on Android.

rygo6-MS commented 4 years ago

586 Native Video Render is currently open for testing.