w3c / mediacapture-transform

MediaStreamTrack Insertable Media Processing using Streams
https://w3c.github.io/mediacapture-transform/
Other
44 stars 20 forks source link

A review #38

Closed jan-ivar closed 3 years ago

jan-ivar commented 3 years ago

This is a review in response to a Call for Review expiring today. I was going to open issues as requested, only to find they mostly exist already — many predating the CFR even — but may be hard to find. So I'm writing this overview instead.

Outstanding design issues I see as blockers:

  1. Don’t expose raw realtime media on main thread, since there's a precedent that says this is bad https://github.com/w3c/mediacapture-transform/issues/23#issuecomment-842585817
  2. Limit to solving video for now, or explain why AudioWorklet is insufficient, to not reinvent it https://github.com/w3c/mediacapture-transform/issues/29
  3. Lack of a threading model https://github.com/w3c/mediacapture-transform/issues/27
  4. Lack of algorithms https://github.com/w3c/mediacapture-transform/issues/36
  5. Controlling channel mechanism is unclear https://github.com/w3c/mediacapture-transform/issues/24
  6. Unclear memory I/O design given that a VideoFrame may be in GPU memory https://github.com/w3c/mediacapture-transform/issues/34 (Funny hat: GPU→CPU→GPU?)
  7. Can one JS document hose a browser's camera capturing ability? https://github.com/w3c/mediacapture-transform/issues/37

The other issue is that there seem to be massive missing areas in WebCodecs that we rely on:

  1. VideoFrame seems optimized for WebCodecs, not funny hats, re copies — Is there room to question immutable VideoFrames vs e.g. lockable mutable ones when frames are not in GPU?
  2. Pixel formats seem underspecified, https://github.com/w3c/webcodecs/issues/165 and https://github.com/w3c/webcodecs/issues/200 vs. e.g. earlier discarded ideas.

There also appears to be a lack of documentation or working samples to the point where any web developer can use it.

This makes it hard to see the big picture or validate whether the overall design addresses the listed use cases (a list that judging by the examples given, does not appear to require interfacing with WebCodecs — is that an omission?)

chcunningham commented 3 years ago

Answering just the questions for WebCodecs. @sandersdan @dalecurtis

  1. Unclear memory I/O design given that a VideoFrame may be in GPU memory #34 (Funny hat: GPU→CPU→GPU?)

VideoFrame is (will be) a CanvasImageSource. GPU:CPU copies are not required to do funny hats (e.g. use canvas drawImage() or webGL texImage2d()).

Similar to ImageBitmap, WebCodecs does not describe the precise location of VideoFrame backing memory (GPU, CPU, shared, ...). This should not be required to define its observable behavior.

  1. VideoFrame seems optimized for WebCodecs, not funny hats, re copies — Is there room to question immutable VideoFrames vs e.g. lockable mutable ones when frames are not in GPU?

Please see above.

  1. Pixel formats seem underspecified, w3c/webcodecs#165 and w3c/webcodecs#200 vs. e.g. earlier discarded ideas.

The proposed use of fourcc style pixel formats is widely used across existing codec APIs (e.g. FFmpeg). See: https://www.fourcc.org/yuv.php. We will soon provide more verbose definitions for the each format as described in the linked issues. We will additionally extend the list to include other formats (e.g. RGBA, NV12, ...)

youennf commented 3 years ago

Following my email on public-webrtc, before diving into API specifics, I think we should be discussing the following foundational topics:

youennf commented 3 years ago

We will additionally extend the list to include other formats (e.g. RGBA, NV12, ...)

Agreed this is an important topic, media pipelines try hard not to do any conversion. https://github.com/w3c/mediacapture-extensions/issues/13 might somehow help. I am unclear whether implicit conversions (like painting a YUV frame on a canvas) is a good thing or not.

dalecurtis commented 3 years ago

I am unclear whether implicit conversions (like painting a YUV frame on a canvas) is a good thing or not.

I agree we want to avoid implicit conversions in the course of codec I/O (encode/decode) and transfer (postMessage, transferable streams) operations for WebCodecs. However, draw operations should allow implicit conversion so the user agent can choose the most efficient platform specific rendering path.

E.g., drawImage(VideoFrame) works the same way it does for drawImage(HTMLVideoElement). In Chrome we will defer the conversion until compositing or may even just hand off the YUV data as an overlay to the platform.

guidou commented 3 years ago
  • Don’t expose raw realtime media on main thread, since there's a precedent that says this is bad #23 (comment)

This has been replied to in #23. It seems that we have agreement that processing on main thread should not be impossible. We still do not have consensus on how hard it should be (no one has replied to my proposal to have it off by default and require an explicit constructor parameter to enable it). There are valid use cases for using the main thread. For example, easier migration from canvas capture and any application where not a lot of processing occurs and using a Worker would add no benefit other than some extra resource consumption and making the application more difficult to write.

  • Limit to solving video for now, or explain why AudioWorklet is insufficient, to not reinvent it #29

This has been responded to in several messages of #29, but here's a summary:

  • Lack of a threading model #27

It's not clear to me why this would be needed. I asked for further clarification in #27.

  • Lack of algorithms #36

The algorithms are not really needed since the spec is defined in terms of the results the APIs should produce instead of the steps required to produce them. However, since the streams API recommends using them, I'll prepare a PR to use them (without any change in semantics).

  • Controlling channel mechanism is unclear #24

Can you elaborate on what is unclear about it? I replied to #24 a while ago and haven't received any extra feedback since them.

  • Unclear memory I/O design given that a VideoFrame may be in GPU memory #34 (Funny hat: GPU→CPU→GPU?)

Already replied to by Chris in https://github.com/w3c/mediacapture-transform/issues/38#issuecomment-842808300 Still unclear?

  • Can one JS document hose a browser's camera capturing ability? #37

Not by default, but yes, if the document reads enough frames and does not close them a camera capturer can suspend producing frames.

youennf commented 3 years ago

It seems that we have agreement that processing on main thread should not be impossible.

I would phrase it this way:

It's not clear to me why this would be needed. I asked for further clarification in #27.

I added some comments in the thread.

The algorithms are not really needed since the spec is defined in terms of the results the APIs

The current spec would be really hard to implement in an interoperable manner. Some JS exposed behaviours are actually not specified for instance. Describing these algorithms will allow reaching interoperability.

Can you elaborate on what is unclear about it? I replied to #24 a while ago and haven't received any extra feedback since them.

I added some comments in the thread.

Already replied to by Chris in #38 (comment) Still unclear?

@aboba might clarify things here. My understanding is as follow:

  • Can one JS document hose a browser's camera capturing ability? #37

Not by default, but yes, if the document reads enough frames and does not close them a camera capturer can suspend producing frames.

The general question is how much we can reduce this risk and whether APIs could be designed with this in mind. For instance, the current API is exposing a pattern that seems nice: when enqueueing a frame in a MediaStreamTrack generator, the frame gets consumed. I would hope we can continue improving the design to further reduce these risks.

guidou commented 3 years ago

It seems that we have agreement that processing on main thread should not be impossible.

I would phrase it this way:

  • There is no consensus on a goal to provide an API that allows processing on main thread.
  • There is consensus on not having a goal to forbid main thread processing.

I agree with this phrasing.

The current spec would be really hard to implement in an interoperable manner.

Are you referring to the streams algorithms or something else specifically?

Some JS exposed behaviours are actually not specified for instance.

Can you file specific bugs about those behaviors?

Describing these algorithms will allow reaching interoperability.

I'll prepare a PR to include those algorithms. Tracked on #36.

  • We have MediaStreamTrack data that will often be backed by GPU data
  • We have WebGPU API that will allow processing GPU data
  • Is this API able to do WebGPU processing on MediaStreamTrack data in an optimal manner, ideally no conversion/no memory copies? If not, do we have a good idea on how to add that support at a later stage.

I think we should continue this discussion on a separate bug, perhaps on the WebCodecs spec if we agree that WebCodec's VideoFrame is the right API to expose video frames coming from a MediaStreamTrack.

  • Can one JS document hose a browser's camera capturing ability? #37

Not by default, but yes, if the document reads enough frames and does not close them a camera capturer can suspend producing frames.

The general question is how much we can reduce this risk and whether APIs could be designed with this in mind. For instance, the current API is exposing a pattern that seems nice: when enqueueing a frame in a MediaStreamTrack generator, the frame gets consumed. I would hope we can continue improving the design to further reduce these risks.

I agree that the goal should be risk reduction since completely eliminating this possibility would negate the goal of optimal WebGPU processing.

guidou commented 3 years ago
  1. Don’t expose raw realtime media on main thread, since there's a precedent that says this is bad #23 (comment) We have shown that there are valid use cases for using the API on the main thread and I believe that we have agreed that using the main-thread is discouraged, but forbidding access on the main-thread is not an explicit goal. Either way, we can continue the discussion in #23.

  2. Limit to solving video for now, or explain why AudioWorklet is insufficient, to not reinvent it #29 Already explained, and the spec now mentions some use cases for which the proposed API is a better fit. We can continue the discussion in #29.

  3. Lack of a threading model #27 Closed. No threading model required unless a more specific need is raised.

  4. Lack of algorithms #36 Fixed.

  5. Controlling channel mechanism is unclear #24 The mechanism has been removed.

  6. Unclear memory I/O design given that a VideoFrame may be in GPU memory #34 (Funny hat: GPU→CPU→GPU?) Already answered. This would be a WebCodecs issue anyway.

  7. Can one JS document hose a browser's camera capturing ability? #37

Already answered in #37.

The other issue is that there seem to be massive missing areas in WebCodecs that we rely on:

  1. VideoFrame seems optimized for WebCodecs, not funny hats, re copies — Is there room to question immutable VideoFrames vs e.g. lockable mutable ones when frames are not in GPU?
  2. Pixel formats seem underspecified, w3c/webcodecs#165 and w3c/webcodecs#200 vs. e.g. earlier discarded ideas.

Already answered.

There also appears to be a lack of documentation or working samples to the point where any web developer can use it.

Some working samples: https://webrtc.github.io/samples/src/content/insertable-streams/audio-processing/ https://webrtc.github.io/samples/src/content/insertable-streams/video-processing/

This makes it hard to see the big picture or validate whether the overall design addresses the listed use cases (a list that judging by the examples given, does not appear to require interfacing with WebCodecs — is that an omission?)

Not clear to me what this means. Can you file a specific issue with more details about what you mean?

Closing this since only a handful of issues raised by this review remain open and it's easier to continue the discussion in their respective github issues.