w3c / mediacapture-record

MediaStream Recording
https://w3c.github.io/mediacapture-record/
Other
104 stars 21 forks source link

Support producing raw encoded bitstreams via MIMEType option #67

Open yellowdoge opened 7 years ago

yellowdoge commented 7 years ago

MIMETypes for the application(s) at hand are usually in the form video/container;codecs="codec1,codec2" (respectively, audio). This format describes a [series of] encoded bitstreams multiplexed into the container format, e.g. webm, mkv, or mp4. It'd be interesting (see #60) to provide ways for the application to request a non-multiplexed encoded bitstream. This would allow the application to provide more sophisticated container formats (e.g. with seeking).

I propose allowing the MIMEType to specify the codec as MIME subtype to indicate requesting of a raw encoded format. This follows the widespread convention video/mp4 versus video/h264, where the former implies a multiplexed MPEG4 stream where as the latter is a raw H264 encoded bitstream.

For the Chrome implementation, this would also translate, e.g. into accepting video/vp8, video/vp9, audio/opus. I wrote a Chrome CL coupled with a demo page as a POC and seems to work, of course there are a few TBDs, but I'd like to ask if this is a correct direction, so WDYT, @foolip, @pehrsons, @uysalere, @martinthomson ?

Pehrsons commented 7 years ago

What other sophisticated container features are you envisioning this for?

If you're only after seeking I'd rather see something that just works (playback in this case) in the browser, like chunks that are all individually playable. Then you could play an individual chunk right off the bat, or you could play multiple chunks with MSE.

That could also solve issues with tracks being added and removed mid-recording. At least on the recording side, I'm not sure how the relevant playback specs treat tracks.

jnoring commented 7 years ago

Honestly, MediaRecorder probably should have produced this output from the beginning, for a couple reasons:

  1. The big issue is getting encoded audio and video. It's prohibitive to have javascript perform encoding for any number of reasons.
  2. The act of muxing data into a container, however, is not a big issue and completely doable in javascript. There is no good reason a javascript implementation for mp4 or mkv containers couldn't be completely viable. I'd imagine any number of existing muxer libraries (e.g. mp4v2) could be ported to javascript with relative ease.

What I'd really like to see is more fine-grain control over the process of encoding media, and less focus on container. So I like this change. If I could get an H.264 encoded bitstream in annex b format, for example, there's nothing stopping me from dumping that to an MP4 file quite easily.

yellowdoge commented 7 years ago

@pehrsons: it's not just the extra container features (but I can think of e.g. muxing subtitle tracks), but also not having to bake into the browser extra code for supporting containers and the like that, like @jnoring says, can perfectly be part of JS given their minimum size (the IVF container code is less than 50 lines in the example above), especially if not all container features are needed.

The questions I'd have are more in the line of if perusing the MIMEType for achieving containerless recording is the best way, and if (JS) multiplexers would need more information than the raw encoded bitstream.

cpearce commented 7 years ago

Can you achieve what you want today by demuxing the output of MediaRecorder in JS, and then re-muxing it in JS? That is, are you asking for something you can already achieve to be made more convenient, or are you asking for new capabilities to be added to the web platform?

tdaede commented 7 years ago

VP8, VP9, and Opus have no native bitstream format, but are rather packet based. Your demo seems rely on ondataavailable firing exactly once per frame and producing exactly one frame worth of data.

martinthomson commented 7 years ago

As @tdaede says, what is defined as "raw" will depend on the format, which will impose very different usage models. I'm actually more concerned about loss of the metadata that comes with the container format (timing, etc...). It seems @cpearce has the simplest approach here. Indeed, if there is a container feature that you need to polyfill, why do you believe that the browser is capable of providing you the information you would need to produce that?

yellowdoge commented 7 years ago

@cpearce: You could indeed achieve the same final result by demultiplexing the multiplexed container, but why have the platform do all that unnecessary work? Configuring the MR to not do one of the steps seems easy enough.

@martinthomson, @tdaede, indeed VP8/9, H264 and Opus are particularly well suited for this container-less recording mode, hence my second question:

if (JS) multiplexers would need more information than the raw encoded bitstream.

I could envision providing the timestamp of the recorded raw frame, if at all available.

tdaede commented 7 years ago

My point is that those formats are rather ill suited for container-less recording as proposed, because the existing API returns a stream, not a sequence of packets. You have totally changed the semantics of the existing API by making the callback happen per packet. Likewise, I don't think MIME types make sense for non-stream data - if you save a bunch of VP9 or Opus packets to a file, it's unplayable.

why have the platform do all that unnecessary work?

The platform presumably already has to implement the muxing, unless you are proposing removing that from the API.

Your application can already be implemented with the existing API, so I think a polyfill for your proposed API would be a better proof of concept. Also, writing to a container other than IVF would be enlightening, because IVF contains much less data than most containers - it lacks seeking, timing, audio, etc.

For an example of a packet-based API, you might want to look at ffmpeg, for example their AVPacket: https://www.ffmpeg.org/doxygen/3.0/structAVPacket.html

yellowdoge commented 7 years ago

The platform presumably already has to implement the muxing,

This is correct: The platform has to ship a(t least one) muxer, but it is usually able to switch it off programmatically, at run-time. Let me clarify my reply: yes, demux-remux in JS is doable, but having the platform multiplex the encoded bitstream(s) and then have the JS code demux-remux looks suboptimal versus telling the platform not to multiplex at all.

H264, VP8, VP9, Opus, they all seem to work on encoded packets granularity. So, again my second question, what information would be needed for a JS multiplexer to work with a platform that produces encoded frames (perhaps not via ondataavailable? How's the Firefox implementation? What is generally understood when MIMEtype is video/h264?

tdaede commented 7 years ago

It might be better to think of the muxed stream itself as an API. The JS is provided with the information as a byte array, which it can parse the needed data out of. Sure, you can provide the information in your own way as JS objects, but I'm not convinced that is better than representing it as one of several standard containers (MP4, Matroska/WebM, Ogg).

what information would be needed for a JS multiplexer to work with a platform that produces encoded frames?

You are missing things like PAR for VP8, and channel configuration for Opus. Also, how is A/V sync going to work?

What is generally understood when MIMEtype is video/h264?

A H.264 Annex B stream, which is basically a mini-container in its own right, comparable to IVF (though more feature complete than it).

Is your goal mainly to reduce the amount of JS code (by increasing browser complexity)? If the JS is already muxing, is demuxing that much of an extra burden? (There are already examples of this for MSE, such as HLS.js.)

I think other improvements to the API, such as being able to read fragments as @Pehrsons suggested, are a more useful improvement for those implementing custom muxers.

jnoring commented 7 years ago

I'm not convinced that is better than representing it as one of several standard containers (MP4, Matroska/WebM, Ogg)

As a video engineer, I would dramatically prefer to have raw encoded data than the current Blob output. Why? It's much more understandable, far more flexible, and I have the ability to mux it into any container I want or buffer it however I see fit, etc. It can easily be streamed and recorded in really unique ways.

No arguments that it could be implemented with a js shim to parse the current blobs, but it seems silly to mux -> demux when the middle man can be skipped. I view this as a far more flexible API than having output into a muxed container.

You are missing things like PAR for VP8, and channel configuration for Opus.

...these are pretty easy things to communicate. It's really no different from communicating SPS/PPS info in an H.264 stream or channel config in AAC; you can either insert it in-stream (so it's a frame unto itself), or have some other means of fetching it. Codec "special data" more or less. Not hard.

Also, how is A/V sync going to work?

Well presumably timestamps are already converted into a common clock so they can be muxed into containers; it seems pretty simple to pass that though on per-frame data. Maybe I'm not following you though.

Is your goal mainly to reduce the amount of JS code (by increasing browser complexity)?

My hope would be to have something that provides a lower level interface for people who want to go far beyond basic muxing into a file.

foolip commented 7 years ago

So, a simple concatenation of all bytes that comes out of a video encoder won't be intelligible, so I guess the choices are:

  1. Provide framing by means of the individual Blob objects and the events fired, either adding timestamp and other needed information to a new type of event, or delivering an object that has both a Blob and the extra information.
  2. Pick a container format that's trivial to parse and only needs to work with a single track, and add support for that.
jnoring commented 7 years ago

Nobody is proposing a "simple concatenation of all bytes." Agreed, that would be unintelligible.

jnoring commented 7 years ago

In thinking about this more, it isn't clear to me that the current "blob" output could even be parsed. For example, most MP4 files are not parseable before they've been finalized as the moov atom isn't complete. And often, the moov atom is at the end of the file if the mp4 file hasn't been correctly optimized.

Is it known that the js shim approach to decoding outgoing blobs would actually result in a stream that could be parsed, or would the file need to be finalized?

yellowdoge commented 7 years ago

I have been thinking about the arguments exposed by @tdaede and @jnoring and I here's the situation as I see it:

  1. the container-less demo and the associated proposal of container-less mimeTypes would probably work well and make sense only as long as the API enforces that ondataavailable is served fully contained frames, but that'd mean an extra constraint on the API behaviour (of ondataavailable) that might prove overbearing;
  2. alternatively, extending the API to support fragments of encoded+contained frames and/or other necessary metadata (e.g. timestamps or keyframe indication) in ondataavailable would probably (over?)complicate the API.
  3. finally, adding a low complexity container and/or this necessary metadata is idempotent to adding a JS webm demuxer (well, it would be suboptimal to mux-demux, but that's a different story, and then again JS cannot directly access a Blob's data).

I've been trying to write a JS demo demultiplexing a recorded, video-only webm and remultiplexing it into IVF to showcase 3., but met no success so far.

For all that my original proposal of hijacking extending mimeType and ondataavailable makes less sense to me now. Thoughts? Alternatives?

Pehrsons commented 7 years ago

For fragments in 2; I imagine an extra member to MediaRecorderOptions saying that the blobs should be individually playable fragments. I don't think that complicates the API.

jnoring commented 7 years ago

@Pehrsons I like that idea. If I could get proper "fragments" that are guaranteed to be parseable, that's a good solution. Or if I could somehow specify the blob always represent one closed, parseable GOP, that'd be interesting too.

tdaede commented 7 years ago

Sounds good to me too, it makes the API 1:1 with MSE, so it's also useful even if a lower level API is added later. Like @jnoring said, it's also important to specify how this works with GOPs, as although it's conventional to have 1 GOP per fragment, it's allowed with MSE to have multiple fragments per GOP - you can even have one fragment per frame if you want. It might not need to be too configurable, but does need to be well-defined.

alvestrand commented 7 years ago

Seems to me that a frame-producing API is a different API from MediaStreamRecorder, and should be specified separately. Trying to munge the per-frame mode into the present recording API is not likely to give a clean API for either function.