Closed youennf closed 5 months ago
It's not clear to me what "not as safe as" refers to here - what's the threat model that makes this a safety difference? The point of doing the decoupled model is that the result of consuming a ReadableStream does not need to be a WritableStream, and the source of a WritableStream does not need to be a ReadableStream; one can construct these from any source.
This is important for several applications, including using WebCodecs to encode/decode, and for using other transports than PeerConnection to move data around.
It's not clear to me what "not as safe as" refers to here It refers to the fact that you might end up buffering frames or doing copies. Cloning a MediaStreamTrack is not triggering those issues, there is no JS objects involved, there is no hard guarantee video frames will be exactly the same between cloned tracks, though there is an assumption they should be the same as much as possible. Depending on implementations, cloned MediaStreamTrack with software-based decimation (size or frame rate) might slightly differ for instance.
one can construct these from any source.
That is indeed an important point. First, we can note this is only a problem for video. For audio, WebAudio allows reading and writing MediaStreamTrack.
While anyone can construct streams however they want, streams are not consumable by any existing media API (HTMLMediaElement, MediaRecorder, RTCPeerConnection). I doubt we will add new methods taking streams as input for these APIs (or is it the plan?). Instead, we might simply create the ability to feed a MediaStreamTrack with video frames through JS. Such a construct might not be difficult to implement and specify. Once we have that construct, it is unclear what benefits we get from exposing streams, at least in window environments. In worker environment, where would happen the actual frame by frame processing, we indeed need some form of exposing frames in a natural manner, stream is a possibility.
This is important for several applications, including using WebCodecs to encode/decode, and for using other transports than PeerConnection to move data around.
These are new APIs. Do they plan to use streams, frames or both as input? Looking at https://wicg.github.io/web-codecs/#mediastreamtrack-monitoring, it seems the envisioned way is to expose MediaStreamTrack frames with a callback-based API.
Is using ReadableStream/WritableStream the right approach?
At the risk of sounding like I have not been paying attention, but can someone explain to me, what is the difference between approach a) a separate read and a write stream, and b) a TransformStream. The latter looks like it contains a read and a write stream anyway, isn't this the same thing?
A TransformStream can be implemented natively or in JS. TransformStreams have been used to implement native transforms like compression or text encoding/decoding. If it is done in JS, what will be exposed to JS is each stream chunk with a https://streams.spec.whatwg.org/#ts-default-controller-class to continue filling the pipe.
If you expose a read and write stream, it is easy to use a transform, using something like read.pipeThrough(transform).pipeTo(write).
The issue I am investigating here is whether we want to expose ReadableStream/WritableStream as a sort of replacement to MediaStreamTrack. For instance, with web audio, you can get the audio as a MediaStreamTrack, implement the transform as an AudioContext and continue manipulating the transformed data as a separate MediaStreamTrack object. Web audio also allows exporting the audio data, for instance compress it and send it to the network, without creating a transformed MediaStreamTrack object.
I would derive some use-cases/requirements for video:
An advantage of the readable/writable approach here is that processors and generators allow the creation of custom sources and sinks. For example, you can create a custom sink to transfer the track over the network using something other than a peer connection. I believe some people are currently experimenting with this.
For example, you can create a custom sink to transfer the track over the network using something other than a peer connection.
This seems to be use case 1:
- JS access to MediaStreamTrack video frames (ideally off the main thread)
https://wicg.github.io/web-codecs/#mediastreamtrack-monitoring can be used for the same purpose. It would be nice to evaluate the pros and cons of various alternatives.
VideoTrackReader in WebCodecs is being removed in favor of MediaStreamTrackProcessor because they have the same purpose, but MediaStreamTrackProcessor has better support for running on workers.
I think separating readable/writable to allow implementation of custom sources, sinks and allowing transformations provides more flexibility than the transform-only approach. It's also easy to understand in terms of the familiar source-track-sink model. It also allows flexibility with regards to workers. Its only disadvantage would be running on the main thread by default, but I don't think this is a major issue. Experience in the field shows that the main thread can work fine for many use cases, and moving to workers is easy as well.
Its only disadvantage would be running on the main thread by default That is one disadvantage.
ScriptProcessorNode is an evidence of this issue for audio and the plan is to obsolete it, not complement it with Audio Worklet. Why should we reintroduce this potential issue?
Experience in the field shows that the main thread can work fine for many use cases, and moving to workers is easy as well. On certain devices, on certain configurations. The more we will see camera shooting with higher fps, the worse it might get. I am not against main thread processing, but it seems this should be opt-in not the default.
There are other potential issues that might get solved (or be more acute) once the relationship between a track and its streams is made more precise. Currently the spec is light there so it is difficult to assess it:
Also, this somehow creates two APIs doing roughly the same thing: MediaStreamTrack and ReadableStream/WritableStream pair. API-wise, we would prefer sticking with MediaStreamTrack producers and consumers. Or do we anticipate other APIs to get raw content directly from streams and replace MediaStreamTrack progressively? Ditto for MediaStreamTrackProcessor.writableControl which seems potentially somehow redundant with existing MediaStreamTrack APIs like enabled/disabled/applyConstraints. It would be good to clarify this, at least for me.
Talking of the signal control API, I am not clear about what happens if the processor does not process the signals or is not prepared to understand some signals (given it is extensible). I could think of some transforms just forgetting about sending the signals up in the chain. So they could potentially pile in the stream, or would there be some backpressure as well to help the caller understand what is happening? Or they could be lost in the middle of the chain. I am not sure how feasible it is, but for those kind of signals, it might be good to have a default behavior and allow the application to override it.
Maybe I should file separate issues but I want to ensure we pick the right design before we spend time on editing work. Maybe we should do the exercise of comparing this model with other envisioned models like VideoTrackReader and the benefits will get clearer.
Its only disadvantage would be running on the main thread by default That is one disadvantage.
ScriptProcessorNode is an evidence of this issue for audio and the plan is to obsolete it, not complement it with Audio Worklet. Why should we reintroduce this potential issue?
I don't think it's exactly the same case. Could you send a ScriptProcessorNode to a worker to free up the main thread?
Experience in the field shows that the main thread can work fine for many use cases, and moving to workers is easy as well. On certain devices, on certain configurations. The more we will see camera shooting with higher fps, the worse it might get. I am not against main thread processing, but it seems this should be opt-in not the default.
There are other potential issues that might get solved (or be more acute) once the relationship between a track and its streams is made more precise. Currently the spec is light there so it is difficult to assess it:
- For instance, it is not clear how backpressure applies in general. For audio this might be a no-go, loosing a frame will cause audio chirps. For camera, this might be beneficial in cases the processing pipe is too slow to process. That might help the case of a camera that is using a buffer pool and if each buffer in the buffer pool is in the pipe, capture will stop. How do you anticipate MediaStreamTrackProcessor use the backpressure mechanism? Should MediaStreamTrackProcessor be able to define its backpressure mechanism in terms of number of frames that are queued?
Yes, the processor is able to specify the number of buffered frames (we recently updated the spec to reflect this).
- Similarly, how does MediaStreamTrack cloning refer to ReadableStream cloning? Is it guaranteed to get the same content between the two approaches?
Cloning a track in general gives you another track backed by the same source. Cloning a MediaStreamTrackGenerator (which is a track) returns a regular MediaStreamTrack that will carry the same frames written to the MediaStreamTrackGenerator via its writable. Cloning the readable of a MediaStreamTrackProcessor using the tee() method returns two streams that get the same frames (i.e., the same VideoFrame objects). Does that answer the question?
Minor correction: tee() sends the same VideoFrame object to both streams, not 2 different objects backed by the same underlying frame.
Also, this somehow creates two APIs doing roughly the same thing: MediaStreamTrack and ReadableStream/WritableStream pair. API-wise, we would prefer sticking with MediaStreamTrack producers and consumers. Or do we anticipate other APIs to get raw content directly from streams and replace MediaStreamTrack progressively?
I don't anticipate replacing MediaStreamTracks with streams, since they do not even function as tracks. Streams allow applications to define custom MediaStreamTrack producers and consumers (i.e., sources and sinks). These custom sources and sinks defined in JS are not intended to replace platform sources and sinks, let alone tracks.
Ditto for MediaStreamTrackProcessor.writableControl which seems potentially somehow redundant with existing MediaStreamTrack APIs like enabled/disabled/applyConstraints. It would be good to clarify this, at least for me.
The purpose of writableControl and readableControl is to expose a control plane that already exists behind the scenes between sinks and sources. The idea is to allow control feedback to flow between custom and platform sinks and sources. They are neither a replacement nor an alternative to enabled/disabled/applyConstraints. There might be some overlap with applyConstraints, but the concept is different. For example, a peer connection platform implementation cannot disable a track, but it can (and in some practical implementations does) sometimes ask a source for a new frame. API wise, readableControl and writableControl could be replaced with events and methods. We are proposing the stream approach partly for consistency with the media plane, and partly because they are easy to transfer to a worker.
Talking of the signal control API, I am not clear about what happens if the processor does not process the signals or is not prepared to understand some signals (given it is extensible). I could think of some transforms just forgetting about sending the signals up in the chain.
Nothing should happen if a source (custom or platform) does not understand a signal, or if a sink (custom or platform) does not send a signal. They are intended to be hints to which sources can (but are not required to) react. If a signal does not make sense for a particular source or sink, they do not need to use it. For example, today some platform sinks can request a new frame from a platform source. Some sources are designed to produce a new frame when that signal is received and other sources just ignore the signal. This happens behind the scenes using a regular MediaStreamTrack with platform sources and sinks, independent from the API proposed here. We consider it necessary to expose some of these signals to JS since we are allowing the creation of custom sources and sinks in JS.
So they could potentially pile in the stream, or would there be some backpressure as well to help the caller understand whaft is happening?
The spec currently does not describe what to do in this case, which means it's up to the implementation. I agree that the spec should be clearer here. Signals are just hints, so we should not provide delivery guarantees, but we could allow the application to specify buffer sizes as is the case for media.
Or they could be lost in the middle of the chain. I am not sure how feasible it is, but for those kind of signals, it might be good to have a default behavior and allow the application to override it.
An application producing frames using a generator can interpret the signals it receives via readableControl in any way it wants, so overriding is the only possible way to proceed there. Platform sources in practice already handle signals, either by ignoring them or by acting on them if it makes sense. For example, a camera capturer could ignore signals for new frames, while a screen capturer might produce a new frame if requested.
Maybe I should file separate issues but I want to ensure we pick the right design before we spend time on editing work. Maybe we should do the exercise of comparing this model with other envisioned models like VideoTrackReader and the benefits will get clearer.
The WebCodecs group compared VideoTrackReader with MediaStreamTrackProcessor and it was decided to remove VideoTrackReader from the WebCodecs spec (it's gone already). VideoTrackReader was basically the same as MediaStreamTrackProcessor, but always running on the main thread and without control signals. Processing in workers required the application to transfer the VideoFrames to the worker.
Another use case the separate processor/generator enables is that it allows an application to combine frames from different tracks into a single track. A possible use case is a "weather report" application, where one track contains a camera track with the person making a presentation and the other track contains the map/presentation. An application could merge both tracks (each with a processor) into a single track (using a generator). This would be harder to support with a transform-only approach for a single track, for example.
To summarize the answer to the question if ReadableStream/WritableStream is the right approach: a) Is streams the right approach? I believe it is. It gives us production-ready mechanisms with multiple advantages, such as transferability, a familiar interface and a well-known programming model. It is a natural match for what we want to do, which is exposing media data as stream.
b) Is separating readable and writable the right approach? I believe it is. Unlike a transform-only approach, separating readable and writable allows us to support more valid use cases such as:
c) Should we allow processing on the main thread?. Streams make it convenient to do processing both on the main thread and on workers. The fact that many applications are currently using video element + canvas + canvas capture with much (or all) of the processing on the main thread means that the main thread works well in many production scenarios, so it's not necessarily something we should try to forbid or make excessively difficult.
A Transform is what you plug in betwen a MediaTrackProcessor and a MediaTrackGenerator. If there are native transforms defined, we can plug them straight in, and use PipeThrough.
For a callback based API, no such pluggability is possible except on a case-by-case basis. So separating Readable and Writable is a prerequisite for using native transforms, not an alternative to them.
I've been asked to present to this group about WebCodecs' experience using Streams
. The discussion in this thread does not seem to be primarily about technical aspects of Streams
vs other approaches, but if there are such questions I'm here to answer based on our experience.
We wrote a public document outlining our reasons for moving away from Streams
in WebCodecs: https://docs.google.com/document/d/10S-p3Ob5snRMjBqpBf5oWn6eYij1vos7cujHoOCCCAw/view
WebCodecs also built a non-Streams
API extremely similar to MediaStreamTrackGenerator
/MediaStreamTrackProcessor
which was removed in favor of the same (this was VideoTrackReader
/VideoTrackWriter
). We do not think our primary concerns about Streams
in WebCodecs apply to these interfaces.
@sandersdan Your experience implementing VideoTrackReader
/VideoTrackWriter
and its subsequent replacement by MediaStreamTrackGenerator
/MediaStreamTrackProcessor
seem particularly germaine to this discussion. In your presentation on Tuesday, could you show the VideoTrackReader
/VideoTrackWriter
interfaces and go over some of the reasons for the replacement?
interface VideoTrackReader {
constructor(MediaStreamTrack track);
void start(OutputCallback callback);
void stop();
};
At the time VideoTrackWriter
was abandoned it was (still) using a WritableStream
, so it looked very much like MediaStreamTrackGenerator
does now. It had been proposed to switch to a write()
method but this was never implemented:
interface VideoTrackWriter {
constructor(VideoTrackWriterInit init);
readonly attribute MediaStreamTrack track;
void write(VideoFrame frame);
};
I will review the discussion leading to the abandonment. My recollection is that the Insertable Streams proposal was developed in parallel, and it was uncontroversial to defer to WebRTC experts for MediaStream
integration.
The decision was publicly documented in https://github.com/w3c/webcodecs/issues/131#issue-798838697.
The drawbacks of VTR/VTW seem like they will apply to any callback API, due to basic properties of the problem statement (conversion between tracks and VideoFrame
s and back):
VideoFrame
s to be controlled via the MaxBufferSize
argument. If the Callback approach does not include a similar argument, then it will lose data if the previous Callback does not complete in time. VideoFrame
s will encounter issues similar to those encountered in the WebSockets and RTCDataChannel API. Even if a MaxBufferSize
argument is provided to limit the number of queued VideoFrame
s,, timely delivery of data can be disrupted by other events in the event loop. VideoFrame
s that may be of interest to developers. These native transforms could potentially utlize an all-GPU pipeline for optimal performance. With respect to major criticisms of the current API (e.g. memory leaks and lack of support for mute/unmute), the Callback approach does not appear more promising:
VideoFrame
s does not intrinsically address the need to free VideoFrame
s in a timely way so as to avoid stalling capture. So the proposals seem equivalent in this respect.Overall, it seems to me that the Callback approach will at best be "different but not better".
This issue is somehow a duplicate of https://github.com/w3c/mediacapture-extensions/issues/23. Maybe we should move the discussion over there?
The drawbacks of VTR/VTW seem like they will apply to any callback API,
@aboba, I am wondering what are your thoughts after the presentation made on promise-based callback (a la TransformStream). In particular, it solves backpressure and video frame closing, while leaving the possibility for MaxBufferSize et al.
Closing as "moot, we've shipped it".
The current proposed API is based on ReadableStream of frames. It does not seem that pros and cons of this approach have been documented. It would also be interesting to list what other approaches could be envisioned and evaluate them.
For WebRTC insertable streams, we anticipate defining native transforms so it is convenient to be able to use streams algorithms like pipeTo/pipeThrough for free. And there is no stream-like object we can hook to.
If we look at media capture, we already have the MediaStreamTrack which is a kind of ReadableStream. For instance, it can piped into web audio or generated from web audio or canvas. It can also be cloned. If we want to have a native transform that does WebGL stuff, we could simply have a method that takes a MediaStreamTrack, some WebGL stuff and create a new MediaStreamTrack.
Downsides mentioned for webrtc insertable streams can also be mentioned here. For instance, cloning a ReadableStream is not as safe as cloning a MediaStreamTrack. Or the fact that this API makes it convenient to do processing in the main thread, which might be a real foot gun.