w3c / mediacapture-extensions

Extensions to Media Capture and Streams by the WebRTC Working Group
https://w3c.github.io/mediacapture-extensions/
Other
19 stars 15 forks source link

Investigate the possibility to transfer MediaStreamTrack #16

Closed youennf closed 3 years ago

youennf commented 3 years ago

This might be useful in case identified in https://github.com/w3c/mediacapture-screen-share/issues/158. If we go with media capture insertable streams, JavaScript could potentially shim such a postMessage by getting access to individual frames and sending them through postMessage to recreate a MediaStreamTrack. Implementing transfer by the user agent could make it easier to developers and potentially more efficient.

eladalon1983 commented 3 years ago

@youennf, as per your request, some of the challenges Chrome would have in implementing this:

zenhack commented 3 years ago

For my use case (https://github.com/w3c/webrtc-extensions/issues/64#issuecomment-764922669), just handling mic & camera would itself be useful -- really I just need a parent iframe to be able to mediate access to these from a child, though it would be nice to be able to supply apps with other sources at the users' discretion; with other resources apps can request, the platform lets users supply any API-compatible object (which can be supplied by the Sandstorm itself, or by other apps). It would be nice to have that property extend to browser-provided objects too. But perhaps if the general case is a much bigger engineering effort, we should settle for just mic & camera in the short term.

I suppose my need may theoretically apply to other things that generate a browser permission prompts as well, though for stuff like location data it's not prohibitive to just proxy, which we obviously can't do with mic & camera; I'm not sure if there's anything else where we'd really need the browser's help.

jan-ivar commented 3 years ago

Chrome does not currently allow cross-origin MediaStreams, and does not implement the concept of tainting. @guidou is a good person to consult about this.

To triage discussion, this is https://github.com/w3c/mediacapture-main/issues/529.

I am worried that for the general case, expensive additional IPCs would be required to send frames from the place where they are produced (e.g. Canvas) to the new owner of the MediaStreamTrack.

I don't think there's a general case. Camera capture is already over IPC in most implementations, so transferring a MediaStreaTrack would create a second consumer of the original source, much like calling getUserMedia again, not "additional IPC" from the first consumer. Net zero performance cost. Ditto screen-capture AFAIK. Canvas may be different (depending on where its rendering pipeline lives).

What's true though is the MediaStreamTrack API abstracts lots of sources for use in sinks, and support would impact all:

image

Transferring a MediaStreamTrack to a worker has come up before as a way to perhaps simplify some of the raw media access APIs under discussion. cc @padenot

I think the main challenge is figuring out concurrent access to the underlying source, since tracks can be cloned (normally, when an object is transferred, it's a hand-off, but track.clone() sort of defeats that in many ways). So raw media access may be concurrent.

youennf commented 3 years ago

transferring a MediaStreaTrack would create a second consumer of the original source,

We would need to decide whether the transferred MediaStreamTrack would get ended if its origin document is being destroyed. Transferred Readable/WritableStreams work this way for instance.

Net zero performance cost. Ditto screen-capture AFAIK. Canvas may be different (depending on where its rendering pipeline lives).

Agreed. Some 'transfer' operations for some MediaStreamTrack might have a very limited cost. But others might have a greater cost.

jan-ivar commented 3 years ago

transferring a MediaStreaTrack would create a second consumer of the original source,

We would need to decide whether the transferred MediaStreamTrack would get ended if its origin document is being destroyed.

Yes. On one end, there's precedence with same-origin popups (which can be handed MediaStreamTracks today by the opener), where lifespan is tied to the original document. This works in most browsers, modulo some recent regressions.

A driving factor behind that behavior may be A) accidental, and B) all the privacy indicators are still in the original document.

On the other hand, being able to temporarily hand off MediaStreamTracks across same-origin navigation (via say a service worker), might solve the lobby problem, without requiring organizing apps entirely around the history API (it's why you get no haircheck in WebEx on Safari, and double-prompted in Firefox, though in the latter we're working on ideas to fix this differently).

That said, extending camera/mic capture past tab close I think would be scary and surprising. 😨

Transferred Readable/WritableStreams work this way for instance.

Those in contrast seem designed specifically for cross-realm piping, so they inherently need to work that way.

But if I

postMessage(track.clone())

...then that's different from that I think. It's more like passing a resource handle around.

eladalon1983 commented 3 years ago

I don't think there's a general case. Camera capture is already over IPC in most implementations, so transferring a MediaStreaTrack would create a second consumer of the original source, much like calling getUserMedia again, not "additional IPC" from the first consumer. Net zero performance cost. Ditto screen-capture AFAIK. Canvas may be different (depending on where its rendering pipeline lives).

It seems to me like (1) mic/camera and (2) screen-capture are special, in that the real source there is the browser itself. These contrast with cases where the application is the source of the data; the examples I can cite include (a) Canvas, (b) PeerConnection, (c) Breakout Box, (d) Web Audio and (e) HTMLMediaElement. Four out of these five are in the slide you've referenced. I am not sure if this list is exhaustive or not.

By "general" I mean to say {1, 2, a, b, c, d, e, ...}. Of these, we agree that 1 and 2 can be done without an additional IPC. But all other cases would, AFAICT, require additional IPCs.

What is suggested? To support transferability of only 1 and 2, and make transferring fail for the other cases? Or to support all cases, but have it be cheap in some cases and expensive in others?

jan-ivar commented 3 years ago

That seems about right.

and make transferring fail for the other cases?

Well, there are potentially same-process use cases like posting to a worker that would still work for all.

For getTabMedia, my preference would be to assign it to an RTCRtpSender directly, instead of postMessaging to a different iframe just to do the same thing.

youennf commented 3 years ago

where the application is the source of the data; the examples I can cite include (a) Canvas, (b) PeerConnection, (c) Breakout Box, (d) Web Audio and (e) HTMLMediaElement.

For most of these existing cases, there needs to be a good way to share audio and video across processes anyway. For example, PeerConnection implementations may do HW encoding/decoding out-of-process. Ditto for web audio rendering. Ditto for compositing. This is not to say that the cost of transfer will be null, just that it might not be as bad as we fear.

What is suggested? To support transferability of only 1 and 2, and make transferring fail for the other cases? Or to support all cases, but have it be cheap in some cases and expensive in others?

I would first define what we want out of MediaStreamTrack transfer, and how we want to define it. For the what, there is a desire to postMessage across iframes, as well as in DedicatedWorkers for processing. For the how, it seems we can agree that their lifetime is bound to the context in which they are created.

I would hope we can add support for these cases in a reasonably cheap way, and have it almost free in a few known cases (in-process transfer for instance).

eladalon1983 commented 3 years ago

References to MediaStreamTrack "types" in this comment will refer to the enumeration from a previous comment.

and make transferring fail for the other cases?

Well, there are potentially same-process use cases like posting to a worker that would still work for all.

There are probably simple cases, but I think a definition will be necessary for all cases. It is not clear to me how the suggestion discussed here intends to handle cross-origin transfer of MediaStreamTracks of types a-e.

If the definition that ends up being chosen specifies that transferability only applies to MediaStreamTracks of type 1-2, and that attempting to transfer tracks of types a-e would fail, then I might be able to incorporate implementing this into my work plan. (My work plan currently centers on cropping MediaStreamTracks of type-2.) But if the specification that is chosen allows the transfer of MediaStreamTracks of any arbitrary type, then the required engineering effort would be quite substantial, and I am not aware of anyone working on Chromium that is currently interested in the engineering investment that would be required for implementation.

there needs to be a good way to share audio and video across processes anyway

Transfer between different types of processes comes at different engineering and CPU costs. It is not immediately obvious to me that Chromium has an easy-to-implement, efficient way to transfer video frames from one render-process to another, when these frames originate in the render process itself (i.e. MediaStreamTrack types a-e).

This of course does not mean that making MediaStreamTracks transferable is not desirable. It does sound like an overall good thing to me. I just don't think that the investment necessary to implement transferring of arbitrary tracks would be within scope for me atm. And I am not sure who else might need it enough to implement it in the near future. Perhaps @alvestrand knows.

zenhack commented 3 years ago

I have some interest in b (peer connection) as well as 1 and 2. The reason for the latter is that I also want to be able to block the inner iframe from establishing webrtc network connections itself (which requires changes to CSP, see https://github.com/w3c/webappsec-csp/pull/457), which would mean that the networking bits would also need to be mediated by the parent frame, and I'd need some way to pass network-obtained streams into the inner iframe. My gut tells me that that should be much easier than e.g. canvas, but I'm not familiar with the relevant internals.

jan-ivar commented 3 years ago

I think we'd first have to decide whether using this to circumvent CSP is a desirable or a concerning property.

zenhack commented 3 years ago

For my purposes it definitely falls into "desirable", but I can see the behavior might be surprising to someone who set webrtc-src: 'none', understandably expecting that to mean "this page cannot do webrtc at all." Perhaps this is an argument for a finer-grained webrtc csp policy than what is proposed in that pr. Maybe we could add a new source type for objects received from other frames or the like.

afaik, csp currently doesn't govern anything where it has to interact with transferability (correct me if I'm wrong), so perhaps that conversation is broader than just webrtc: it applies to anything that csp touches that might be transferable.

zenhack commented 3 years ago

@annevk, interested in your thoughts on the CSP interaction here as well.

(It's a big awkward trying to coordinate issues across several repos that are interrelated like this, sorry for the disorganization...)

annevk commented 3 years ago

I think it depends on where CSP is enforced whether it would work or not (e.g., if it was enforced in the constructor of transferred things this would not work), but it seems okay for me that this bypasses CSP.

zenhack commented 3 years ago

For my part, my preference would be to keep it simple and just allow this, though it would be sad if later somebody came along with a use case where they wanted to block even postMessage transfers, and for compatibility sake we again had to build another corner case where just setting 'none' is not sufficient to get the most restrictive setting.

annevk commented 3 years ago

If you want to be that restrictive you really ought to disable postMessage() though as all kinds of data can flow through there if you allow it.

zenhack commented 3 years ago

Quoting Anne van Kesteren (2021-02-04 01:17:14)

If you want to be that restrictive you really ought to disable postMessage() though as all kinds of data can flow through there if you allow it.

Then perhaps for the purposes of webrtc's CSP policy we shouldn't try to block postMessage(), and if someone wants to do that they can raise the issue of disabling postMessage() somehow separately.

youennf commented 3 years ago

Started a PR at defining the transferring steps. The basic principles are:

annevk commented 3 years ago

Doesn't that allow observing GC? Does that work across agent clusters?

alvestrand commented 3 years ago

Commentary inline.

Started a PR at defining the transferring steps. The basic principles are:

  • If a track gets transferred from realm R1 to realm R2, the MediaStreamTrack in R1 goes into ended state, JS track.stop() in R1 is a no-op

This is good, and reflects that the relationship between track-in-R1 and track-in-R2 is exactly the same as if you call track-in-R2 = track-in-R1.clone(); track-in-R1.stop().

  • If a track that was created in R1 is transferred to R2 and R1 goes away, the track in R2 gets ended

This I'm nervous about. I'd rather phrase it as "If the source of the track created in R1 and transferred to R2 goes away, the track in R2 gets ended". This isn't an issue now (all sources are tightly bound to their realm), but in the case where a Transferable source existed, and R1 handed the source off to R3 before going away, it seems unreasonable for the track to end.

  • Capture indicators and so on stay attached to realm R1, even if capture track is transferred in R2

Agreeed; again, capture indicators belong to sources, not tracks directly.

  • If a track is transferred from R1 to R2 then R3, and realm R2 goes away, track continues to be live. It gets ended when R1 goes away or R3 script decides to stop the track (or UA decides to stop it for whatever reason)

If we move to the definition above, this happens by default.

youennf commented 3 years ago

Doesn't that allow observing GC? Does that work across agent clusters?

We can make sure to keep a strong reference like done for message ports. Not sure what agent clusters are.

youennf commented 3 years ago

This isn't an issue now (all sources are tightly bound to their realm), but in the case where a Transferable source existed, and R1 handed the source off to R3 before going away, it seems unreasonable for the track to end.

In the current spec, we have track ended -> source stopped. I haven't seen source stopped -> track ended, hence the current phrasing. I am fine relaxing the rules but I'd like to understand the potential use case or possibility of transferring sources and not tracks between realms.

annevk commented 3 years ago

See https://html.spec.whatwg.org/#agents-and-agent-clusters. They're the conceptual process boundary, if you will.

youennf commented 3 years ago

Are there already web objects for which transfer works in process but not over process boundary, as a design decision?

I think there are use cases for transferring MediaStreamTrack over process boundary. In practice, the capture is already living more and more out-of-process so User Agents have ways to transfer media over processes efficiently.

Implementations might start with a limited subset though, workers for instance.

annevk commented 3 years ago

SharedArrayBuffer and WebAssembly.Module (the latter is somewhat less principled than the former though, as it could conceivably be serialized).

alvestrand commented 3 years ago

The most likely case of a transferrable source is probably a canvas capture (https://www.w3.org/TR/mediacapture-fromelement/). OffscreenCanvas is already defined as Transferable, so if you generate a MediaStreamTrack from an OffscrenCanvas (using HTMLCanvasElement.captureStream()), and then transfer the OffscreenCanvas elsewhere, it doesn't seem reasonable to stop the track just because the original context goes away.

I'm sure there are dragons here somewhere, though.

youennf commented 3 years ago

if you generate a MediaStreamTrack from an OffscrenCanvas (using HTMLCanvasElement.captureStream())

I see, so there might be a future API that allows generating MediaStreamTrack from an OffscreenCanvas. This new API will have to deal with OffscreenCanvas being transferable. We could probably make the whole thing work but this might add some spec and implementation complexity. A simple solution for that specific case would be to end the track when transferring the OffscreenCanvas.

alvestrand commented 3 years ago

Question about what model to choose for this operation:

Should we make MediaStreamTrack Serializable instead of Transferable?

We have precedent (RTCCertificate, for instance) for objects that are Serializable without their innards being observable; that example is also an example of an object that can be transferred between origins, but only used in its original origin, so we can build in the restrictions we want to have on the object.

If we define it in such a way that Serialize / Deserialize is the exact equivalent of Clone, except that the two may happen in different context, I think a lot of our definitional issues go away, and the PR defining behavior can be a lot shorter.

(call out to @dogben for coming up with the idea)

annevk commented 3 years ago
  1. Putting restrictions on the object goes against the design of postMessage() and friends. The security-model is very much supposed to be object-capability: https://html.spec.whatwg.org/#ports-as-the-basis-of-an-object-capability-model-on-the-web.
  2. The choice between serializable and transferable is really about whether you need detach semantics for the object that is serialized.
youennf commented 3 years ago

I am not sure what we would gain in terms of simplification by using serializable.

My understanding is that RTCCertificate is serializable so that it can be stored in say IDB. See for instance https://w3c.github.io/IndexedDB/#value-construct, in particular: Each record is associated with a value. User agents must support any serializable object.

I do not think we want to store MediaStreamTrack in IDB so would keep using transferable.

annevk commented 3 years ago

@youennf you can use forStorage for that when defining the serializable steps, no? The key question around transferable is whether you need detach semantics.

youennf commented 3 years ago

forStorage

Maybe, by updating the algorithm then?

Focusing on semantics, I am unclear what would be the relationship between the original track and the serialized track. My understanding is we want them to be fully independent, stopping one would not stop the other for instance.

We can achieve this by using transferable. If the user wants to keep the original track, it can easily clone it and transfer the clone. This seems ok to be explicit in the cloning since tracks may be potentially locking process intensive resources.

jan-ivar commented 3 years ago

Conceptually, a certificate is a dead object, a read-only written record/contract. Serializing it into a storable form (e.g. as bytes + maybe encrypted bytes on a disk) makes sense, because imagining it in a stored form makes sense. A MediaStreamTrack is a live object often representing a realtime source that's been negotiated with the user right now, maybe unplugged tomorrow, and whose state may have live impact on things like hardware camera lights and browser privacy indicators. Having such a handle object exist in a serializable/storable form seems like a tenuous concept, as it would appear to challenge whether this handle represents a legitimate reason to keep the device open — storing an handle that keeps devices open to disk seems like a bad idea — and for how long. It also seems harder to track all open references. But maybe I missed the problem being solved?

annevk commented 3 years ago

Perhaps we should rename [Serializable] since it essentially comes down to a copy (and the object that is copied still functions; there's the separate aspect of storing that copy which is something each object can decide for itself through forStorage). [Transferable] is essentially a move (and the object that is moved becomes detached and is no longer usable).

cc @domenic

jan-ivar commented 3 years ago

So by detach semantics (if I understand correctly), we mean transfer of ownership, which I think is the most natural and conservative semantic here, since tracks explicitly ref-count their (often hardware) sources.

We already have copy-semantics of handles with track.clone(), and that API is so so: requiring JS to hunt down every track clone — from track.clone() that they've called — and call track.stop() on all of them to extinguish the camera hardware light and privacy indicators. This design exposes JS to ref-counting leaks, and app bug symptoms like camera indicators sticking around until GC.

Copy semantics would require JS to remember to track.stop() every time they wish to transfer ownership of a camera handle to e.g. a worker for processing, and wish the device to stop automatically once processing is done.

Move semantics would require JS only to remember to track.clone() if it wanted to continue to consume the track on main thread simultaneously.

The former seems easy to forget and hard to detect that it was forgotten, while the latter seems hard to ignore and easy to detect one's mistake (the non-performing/dead track being quite obvious).

jan-ivar commented 3 years ago

Also, every track object carries its own set of constraints on the source, which means copies get out of sync over time with track.applyConstraints(), causing potentially unnecessary work on the part of the browser e.g. what multiple resolution(s) it needs to downscale to, or conflicts on things like pan+tilt+zoom, unless it is able to detect and optimize away tracks without sinks.

alvestrand commented 3 years ago

Re https://github.com/w3c/mediacapture-extensions/issues/16#issuecomment-844093728 - the argument that "anyone who wants copy semantics can do track.clone" is completely analogous to "anyone who wants move semantics can do track.stop".

If, for instance, an application that uses a custom codec in a worker for encoding (an use case considered many times) also does a self-view, the naturally desired semantics of a message operation is the copy semantic, not the move semantic.

annevk commented 3 years ago

It's not completely analogous, right? Because the latter would take up twice the resources.

alvestrand commented 3 years ago

When we already support copy semantics (by .clone), supporting move semantics through a different mechanism is more conceptual clutter. The arguments about Javascript needing to keep track of copies applies fully to tracks produced by .clone too. So again, doing transfer instead of serialize gives no additional simplification.

I come down to this: Sometimes one needs to take a copy and have one of the copies appear in a new context. Sometimes one needs to have the track appear in a new context, no copy is needed.

Both cases can be supported by either Serializable or Transferable, but the definition of Transferable requires three things to happen at once (destruction, moving and recreation), while Serializable provides you with the toolbox to build the operations you need out of conceptually simpler pieces.

Small, sharp tools.

youennf commented 3 years ago

I would mostly look from a web developer point of view. Spec wise and implementation wise, it should be fairly similar to me. The question that comes to mind: what would be the best default behavior, copy or move?

annevk commented 3 years ago

If you need both, you can also use both Serializable and Transferable. They are not mutually exclusive. (See ArrayBuffer for instance.) And they have a different invocation, so you can also add one of them later.

(Given that you have clone() I'm not sure why you'd not want at least Transferable as it seems that combination allows more (from a resource usage perspective), not less. But perhaps there's not a whole lot that ends up being copied anyway?)

jan-ivar commented 3 years ago

Excessive copies are potentially harmful, so transferable seems preferable. We don't need serialization since we have clone.

youennf commented 3 years ago

Fixed by https://github.com/w3c/mediacapture-extensions/pull/24