Face Detection: How metadata should be tied to MediaStreamTrack video frames

youennf commented 1 year ago

Following on https://github.com/w3c/mediacapture-extensions/pull/69 and media capture transform, face detection metadata could be made available to mediastreamtrack transforms. There are a few possibilities we could envision. The following come to mind:

Attach FaceDetection metadata to VideoFrame with dedicated face detection metadata getter/setter (new VideoFrame slot that can be cloned/postMessaged).
Attach FaceDetection metadata to VideoFrame using a generic metadata mechanism (mechanism to be defined, see VideoTrackGenerator).
Make MediaStreamTrack transforms expose objects that have a VideoFrame and a metadata object.
Extend MediaStreamTrackProcessor.readable with a face detection metadata getter (related to the last read video frame). And VideoTrackGenerator.writable with a metadata setter.

youennf commented 1 year ago

1 is probably the easiest approach compared to 2 and is more natural than 3 and 4.

sandersdan commented 1 year ago

The current situation with generic metadata in WebCodecs VideoFrame is that there is support, but no adequate technical solution proposed. I'm interested in any proposal that:

Can be serialized to bytes. (I assume this excludes Symbol)
- Necessary because VideoFrames can be sent between workers, and also the model assumes that all properties of a VideoFrame are carried by the underlying video resource.
Supports namespacing in some form.
- Important to avoid conflicts between future specifications and current application developers.
Has defined semantics for construction and inheritance (eg. what happens withnew VideoFrame(existingVideoFrame, {updatedMetadata: ..., visibleRect: ...})).
- Presumably some metadata should be carried across this (eg. extra timestamps), but other metadata is invalidated (eg. face positions when the visible rect changes).
- The basic rule of 'drop everything' may be good enough, and could be used like: new VideoFrame(oldFrame, {metadata: {...oldFrame.metadata, foo: 123}).

Absent such a proposal, we are still recommending (3) or (4), passing the metadata out-of-band.

I don't think there is strong support for handling face metadata specially, but doing so would be the shortest path to in-band metadata.

youennf commented 1 year ago

Can be serialized to bytes. (I assume this excludes Symbol)

Agreed we need support to clone/postMessage metadata. I was thinking we could use structure cloning (https://html.spec.whatwg.org/multipage/structured-data.html#safe-passing-of-structured-data), which is what is being used when postMessaging a value, say to workers.

For instance, we could add steps in the constructor to structure clone the metadata input parameter and the result would be stored in a VideoFrame object slot. The metadata accessor should either provide a copy of the metadata or the metadata object itself (maybe we should freeze it?).

Supports namespacing in some form.

Good point. I am fine either going with UA defined metadata initially or adding support for web app specific metadata. In any case, both kind of data should probably follow the same principles (data being structure clonable say).

In terms of spec editing, web codec could define a WebCodecMetadata dictionary, either without any member or containing something like a any userDefinedMetata member. WebRTC spec would then define a partial WebCodecMetadata dictionary listing the face detection dictionary members.

The basic rule of 'drop everything' may be good enough

+1

@sandersdan , how does this look to you? Is it precise enough to think about writing a PR?

sandersdan commented 1 year ago

I was thinking we could use structure cloning

Structured clone by itself doesn't work because it assumes there can be side data (such as ports) in addition to the raw bytes. The for storage variant might work, but I'm not familiar enough to say for sure.

It might actually make sense to just drop down to JSON here. I don't think metadata should need to be self-referential, for example.

In terms of spec editing, web codec could define a WebCodecMetadata dictionary, either without any member or containing something like a any userDefinedMetata member.

Yes, this is about the best I was able to come up with as well, and I think it meets the requirements. I like that unlike a partial for VideoFrame, a partial for VideoFrameMetadata would be straightforward to splat.

{metadata: { user: { ... } } } is a bit cumbersome, but the only alternative I have is { metadata: ..., userMetadata: ... } which just trades for complexity instead. One surprise could be that { metadata: { myMetadata: 123 } } would simply be dropped by the IDL binding, but good documentation can overcome that.

Is it precise enough to think about writing a PR?

I think the serialization part needs work before becoming a PR, but it could be at least proposed in the existing bug.

Edit: The existing bug is https://github.com/w3c/webcodecs/issues/189. There is a separate bug for EncodedChunk metadata, https://github.com/w3c/webcodecs/issues/245, but that also adds the complexity of possibly having to copy metadata from frames to chunks or the reverse.

youennf commented 1 year ago

It might actually make sense to just drop down to JSON here

I could see metadata be an array buffer, in which case JSON is not great.

I think the serialization part needs work before becoming a PR, but it could be at least proposed in the existing bug.

I think https://html.spec.whatwg.org/multipage/structured-data.html#structuredserialize is what we want. This is roughly what structuredClone is using under the hood (we do not want any transfer parameters since we want to ensure we can clone frames). forStorage=false is good here.

that also adds the complexity of possibly having to copy metadata from frames to chunks or the reverse.

I do not think we need to expose this to web pages, at least initially. It should be reasonably simple for the web app to set metadata from a VideoFrame to its corresponding chunk. This might be something we might want in WebRTC (metadata from track to encoded transform) but WebRTC spec could handle this metadata passthrough on its own.

w3c / mediacapture-extensions

Face Detection: How metadata should be tied to MediaStreamTrack video frames #70