w3c / mst-content-hint

This is the specification for the content-hint attribute. This optional hint permits MediaStreamTrack consumers such as PeerConnection or MediaRecorder to encode or process track media with methods more appropriate to the type of content that is being consumed.
https://w3c.github.io/mst-content-hint/
Other
4 stars 14 forks source link

Why not RTCRtpSender? #18

Closed martinthomson closed 6 years ago

martinthomson commented 7 years ago

If this value is being used to guide encoding of the stream, why not make this a property of the RTCRtpSender instead?

pbos commented 7 years ago

I discussed this pretty extensively with alvestrand@, the idea is to provide a bit of a hint no matter who or what consumes it, not just RTCRtpSender and it's a description of the track itself. The behavior is intentionally vague/not strictly defined, and whoever consumes it is free to do whatever with it (to make it behave as well as possible, given the hint).

If we move this as an encoder property specifically this would require extending PeerConnection, RTCRtpSender, MediaStreamRecorder as well as adding it to all appropriate future consumers of MediaStreamTrack where appropriate (as well as receiving parts of APIs when appropriate to treat them differently). This seems to scale a bit worse in my opinion.

These aren't intended to fully replace all more strict/defined encoder settings from future APIs, so if future API extensions want to add more strict controls such as disabling resizing, setting max quantization, etc. that's not mutally exclusive to this proposition.

pbos commented 7 years ago

.. @alvestrand notation, I guess..

alvestrand commented 7 years ago

In particular, applying this for "don't wah-wah my music" on the receive side (make NetEq be less aggressive) would require extending RTCRTPReceiver - adding this many new interfaces when one could do the job didn't seem optimal to us.

pbos commented 7 years ago

This would/could affect many interfaces and require multiple additions if they should be specific controls, I think this method require the least ammending to existing standards:

For audio it'd require exposing noise suppression, intelligibility enhancement, packet loss concealment methods and any future additions of implementation-specific signal processing that's only applicable to speech would have to be exposed in APIs to be able to be turned off for music. This means that any web application doing non-speech would have to play catch-up with what's being added to the APIs/browsers to make sure no such treatment is added to their "music" signal, or any speech application have to keep up with them to turn these features on.

I'm not saying that any of these specifically would be bad APIs to add, but I don't think that all speech processing should have knobs/buttons, and if there's a hint defaults can be appropriately applied (don't turn on noise suppression for music unless the user specifically asks for it).

For video it would require adding specific APIs such as max QP to RTCRtpSender, but perhaps also specifying how balanced should be treated as a degradationPreference.

And I do like that we don't have to add any such features to MediaStreamRecorder and PeerConnection standards but the implementation can still benefit just the same from this signal. That would be the benefit of not having the contentHint on RTCRtpSender specifically.

jan-ivar commented 7 years ago

@alvestrand Where would contentHint come from on a received track?

jan-ivar commented 7 years ago

@pbos What happens if I do this:

await sender.setParameters({degradationPreference: "maintain-framerate"});
sender.track.contentHint = "detailed";

or this

await sender.setParameters({degradationPreference: "maintain-resolution"});
sender.track.contentHint = "fluid";

or this

sender.track.contentHint = "detailed";
await sender.setParameters({degradationPreference: "balanced"});

?

pbos commented 7 years ago

I think contentHint is available when no other information is provided and does not override properties set on the sink, e.g. maintain-framerate/maintain-resolution would not be overridden by contentHint, since that would make RTCRtpSender spec-incompliant.

Since 'balanced' indicates a tradeoff, how to make that tradeoff could be influenced by contentHint. For example:

Fluid: 30fps HD -> 30fps VGA -> 15fps VGA -> 30fps QVGA -> 15fps QVGA -> 8fps QVGA -> 4fps QVGA...

Detailed: 30fps HD -> 15fps HD -> 8fps HD -> 15fps VGA -> 8fps VGA -> 4 fps VGA -> ...

jan-ivar commented 7 years ago

@pbos What would this return?

track.contentHint = "music";
console.log(track.getSettings().echoCancellation); // ?

Note that the spec at least implies echo cancellation is on by default.

Would track.clone().contentHint = "music"have any effect?

pbos commented 7 years ago

I believe that "music" could imply which algorithms are used for echoCancellation, this is outside my field though. Other than that it could turn on/off noise suppression and intelligibility enhancement and any other speech treatment that are not defined in the spec.

pbos commented 7 years ago

In short, if echoCancellation /should/ be true without contentHint set, it should also be true if contentHint is overriden, if it's implementation defined or if there's a "default" setting which is either on or off, then contentHint could override it.

pbos commented 7 years ago

By override I mean that it could influence what's actually applied behind the scenes, so long as it doesn't invalidate the configured settings.

jan-ivar commented 7 years ago

Sorry I should have mentioned (I believe) Chrome turns off all audio processing when { echoCancellation: false } is specified.

pbos commented 7 years ago

That sounds like an implementation detail to me, per spec I'd expect it to only touch echoCancellation, so I believe the answer is still valid.

pthatcherg commented 7 years ago

Back to the original question of "why not on RtpSender":

In a way, it would make sense to have the contentHint be read-only on the track and settable on the RtpSender parameters. But there are few downsides to that approach:

  1. If the content hint is also useful to other uses of tracks (such as MediaRecorder), then it's beneficial to have the hint in one place rather than many places (RtpSender, MediaRecorder, ...).

  2. When switching between tracks using RtpSender.setTrack, it would be nice to have the content hint automatically switch rather than needing to keep track of which hint goes with which track and apply the hint while applying the track.

  3. The intention of the hint is to give good defaults for things already controllable via the RtpSender parameters. Having both the hint and the controls in RtpParameters would be a bit confusing. Having them separate would be a bit more clear (the track gives hints, but the RtpSender has ultimate control). In other words, the RtpSender is a "do this" API and the track is a "you are this" API. Mixing them in the same object is messy.

TL;DR: Both could work, but I think going on MediaStreamTrack is more clear and easy to use for the user of the API, and is reusable to more situations.

jesup commented 7 years ago

In general: I think we should have some form of content hinting to improve defaults used by content sinks. So thanks for that.

That said, we go back to the question of "where do we put it"? The basic options as laid out here are add it to the sinks (e.g. RtpSender, MediaRecorder, etc) and perhaps sources (RtpReceiver/gUM), or to the tracks.

There are a couple of sources for the hints. One is a true aspect of the captured source, such as screenshare or a camera. Another is things that an application might know or be told, but the system can't (or can only painfully or with errors) determine, such as music vs speech. in addition, some types of processing flows would strip information, or change it. Examples of this would be audio processed through WebAudio, or video sourced from a canvas (or a VideoElement).

If we choose putting it on the sinks (RtpSender/etc), an application would use it's knowledge of where a source originates (camera, screenshare, etc -- the application called gUM or captureStream or a PeerConnection) or what it knows about the source (such as a PeerConnection stream) to set the contentHint on the sink it attaches the track to.

If we put it on the tracks, some hints can be generated automatically (e.g. screenshare, camera) -- though the application knows them, so you save a "track.contentHint = ". The rest can't (Music vs. speech), and have to be set on the track by the application (like it does on the RtpSender in the other case). In addition, any track run through a transform (WebAudio/canvas/videoelement) will require the application to copy the source track's hint to the output track (output.contentHint = source.contentHint or some such), or just set the type since the application already knows the source type/hint anyways.

An application could attach the content type to the track on it's own, of course, and read it to set RtpSender/etc. Certainly some people will forget to copy over the hints, but in the RtpSender case, some will forget to set them when they should and will end up with the defaults.

In the track-based approach, in an unusual/edge case where the data is passed to two sinks where the application wants to have different settings for the hint, the application would have to clone the track. given how unlikely this is, I don't care that they'd have to do this. See below though.

One concern with the track-based approach is an apparent assumption that setting contentHint on a track would flow back to the source in some manner and immediately affect all sinks. I.e. that the source may change processing (for RtpReceiver change netEq options, or for gUM change AEC/NR/etc options), and the sink may change how it encodes "immediately". There's some question about how you would write this into the spec, due to the variance in how data is buffered and processed, but I imagine that part is doable (as with track.enabled). It does mean that internally the sources/sinks need to monitor or register for contentHint updates, or poll it on each pull/push of data. (We sample 'enabled' in all the sinks in data input -- so track.enabled affects all sink implementations, even if it doesn't mean spec changes to the sinks.)

So while the spec for sinks may not need to change, the implementations are more complex than for sink.contentHint = whatever (or source.contentHint = whatever). And unlike enabled, the action taken if it changes when we get input may be much 'bigger' - Perhaps starting and stopping stuff, reconfiguring/recreating objects, and those might have on-this-thread requirements, and in the meantime we'd need to implement buffers for the data until the reconfigure is done. Not that this is a major problem, but we've seen pain like this when handling SSRC changes where we had to buffer packets, ship a runnable/lambda to MainThread, reconfigure, ship another runnable/lambda back to the network thread, and then release all the packets.

Another real problem: if you clone a track from RtpReceiver, then set the contentHint to Music on one clone, and Speech on the other, you now have two tracks, one Music, one Speech (or unset, if you left the other alone). What options are being used with NetEq? Which controls, the last set? This seems at odds with how cloning is supposed to work (though we have a somewhat similar problem with shring a mic with different AEC/etc options). Having the track be a control surface that affects the data in the stream as opposed to a description of the data adds confusion. Similar confusions might exist with gUM mic streams, or with screencapture sources. Also, this doesn't sound like a Hint anymore, it's a control, kinda. (More like an older name, contentType.)

We could state that track.contentHint doesn't affect sources, only sinks, and doesn't modify the data in the track itself, and for sources use direct settings (RtpReceiver.contentHint, or just a direct control -- RtpReceiver.pitchStability perhaps). Or we can modify the sinks/sources. Can we just provide a partial interface that sinks (and maybe sources) for MediaStreams/Tracks implement?

Net: for a while considering it I was thinking I didn't care a lot, but once I started really looking at what dynamic changes cause to occur, and the interactions with sources, I'm still leaning towards sink and source interfaces instead.

Thoughts?

jan-ivar commented 7 years ago

After initially being taken aback by track.contentHint having upstream consequences, I find some support for this in the spec:

"It is fairly obvious that changes to a given source will impact sink consumers. However, in some situations changes to a given sink may also cause implementations to adjust a source's settings."

and

"Implementations may choose to make such source-to-sink optimizations as long as they only do so within the constraints established by the application, ..."

This is the spec allowing browsers to do whatever they want within the envelope of app constraints.

When seen this way, that browsers are allowed to look at an entire media graph when optimizing it, I suppose it makes sense to hang a content-describing signal off of the object that's closest to the content, the track.

The key being this is weak signal to the browser which maintains full discretion.

We probably want to clarify that implementers SHOULD reassess settings whenever track.contentHints change, to avoid creating rituals where JS apps remove and re-attach things, hoping to trigger change.

I think we can write off cases where track clones disagree on contentHint, as a bad signal.

I agree with @jesup about the difficulties, but those are implementation problems.

I agree with @pthatcherg that if we do move it to RTCRtpSender, then it should be something more specific and about controlling narrow behavior, rather than a descriptive property like contentHint. So I don't see a direct conflict in having both.

pbos commented 7 years ago

I agree that the implementation complexity of track hints propagating (especially when multiple tracks are combined into one, etc), make it less attractive. RTCRtpSender and other sinks is the other natural place to put it, I do see that making sense and am not against it.

However I don't agree that placing the contentHint on RTCRtpSender should make it more specific, as that would limit its purpose. The point is to describe what's important in the stream to help the implementation decide on what to do internally for enabling/disabling processing that is not exposed through standard interfaces (there are no controls for speech intelligibility for instance). It also helps informing default actions for buttons/knobs exposed in future spec updates.

I think the spec should list specific examples of what an implementation should/should not do (don't downscale detail content, avoid frame dropping when preserving motion) without mandating it. The implementation should in "the same spirit" enable/disable all non-standard processing depending on whether it's useful for the type of content that's being encoded/processed.

Currently a lot of components in WebRTC (NetEq etc) assume speech. We might want to turn those off when processing music, without exposing buttons and knobs for all of them since they are too implementation specific.

jan-ivar commented 7 years ago

However I don't agree that placing the contentHint on RTCRtpSender should make it more specific, as that would limit its purpose. The point is to describe what's important in the stream...

@pbos Isn't it a natural property of the stream then? Ergonomically (mock syntax for illustration):

content.thisIsMusic = true;

seems superior to

sender.configureForMusic = true;

and avoids confusion about what trumps what if we ever add specific controls e.g.:

sender.useNetEqMode = 3;

There may be implementation challenges, but these are lower in the priority of constituencies.

For a user, there are trade-offs either way: hint getting bleached through web-audio, vs. forgetting to update config after sender.replaceTrack.

pbos commented 7 years ago

It is definitely a natural property of the stream, but it's also problematic for users if they cannot expect their hint to be applied or not, or not grok how any back-propagation of track properties would work.

I think it's dual, so I keep flip-flopping back and forth. If the property is on the sender, you know that it will be applied. If it's on the track, you won't know whether it's propagating to other tracks, or are affecting other components.

Though the track hint propagation problem is already present since we base handing "screencast"/non-"screencast" based on tab capture/usb video capture (I assume Firefox does/might consider to as well?). When you plug gUM() into canvas and then get a capture stream, should this implicit "content hint" carry?

Randell how do you see on gUM() implicit properties propagating in general? If we think a USB capture card that's a known mixer table, we might treat that as music rather than speech (which we might if the mic is from a Logitech camera). It seems to me like we already have this problem, since we already have a hint on the track in the implementation, just not exposed/overridden from JavaScript.

If the problem is that the track's contentHint would propagate upstream, we could limit that by saying that contentHints can only propagate downstream:

A (no content hint) B = A.clone() B.contentHint = "music";

A.contentHint has no contentHint, and any sinks that consume A should act as if B does not exist (cloning A creates no implicit connection to it). Does any part of this make sense?

jan-ivar commented 7 years ago

Yeah the content hint not bleeding across clones makes sense to me, the whole point of clones being to get a clean slate. In contrast, I think it makes sense to try to preserve contentHints through transformations as much as possible, whether it's going through canvas or web audio. Naively, this seems like it will be correct to do most of the time, and when it's not, seems easily corrected in JS.

jan-ivar commented 7 years ago

That said, assuming A and B are gUM and not otherwise constrained, I think it would make sense for a UA to choose e.g. noiseReduction to be off for the mic feeding A and B in this case, given that one of them has hinted at it being music that's being consumed. So in that sense, there's a bleed of net effect, but that's not uncommon with constraints, so I think that's fine here too.

The question would be what to do downstream of A and B in this case of conflicting hints. We could either say:

Or leave it up to implementations. This is an edge case after all.

alvestrand commented 7 years ago

Note - you can't combine tracks as such (except by compositing them into a stream). You can feed tracks to something that consumes tracks, and you can imagine having a track consumer that consumes multiple tracks and emits one track - but we haven't defined any such thing yet.

WRT the case of conflicting hints: I assume you mean a conflict like this:

stream = await getUserMedia(...); track1 = stream.getAudioTracks()[0] track2 = track1.clone(); track1.contentHint = 'music'; track2.contentHint = 'speech';

one reasonable approach would be to ignore the second setting - "first come first served". (if this had been a constraint, the second applyConstraint() would have failed due to a conflict). another would be to attempt to do audio processing on one track only. This would likely lead to rather large implementation complexity for a rather marginal case.

do we need to specify the behavior for this, or can we leave it as "implementation defined"?

alvestrand commented 6 years ago

I'm closing this issue as "discussed enough", with the chairs having come down on accepting the present document, which places hints on tracks, as the starting point for further work. Let's open new issues on the particular areas lacking clarity in the text.