Live stream initialization messiness

clydebarrow commented 2 years ago

I've been working on getting live streams to work nicely in my new UI. Reverse-engineering the existing UI Javascript code revealed that the undocumented X-Video-Sample-Entry-Id header in the websocket stream is used as an index to fetch an initialization segment via /api/init/<id>.mp4. This is messy, because it's necessary to subscribe to the websocket, wait for the first data packet, extract the id, then make another request to the init endpoint, then send that data segment to the mediasource before sending the first packet already received from the websocket.

I notice that in the latest version (0.6.6) the API data structure includes integer id values for cameras and streams, and I had hoped that the stream ids would match those being sent in the websocket data headers for the same stream, but this does not appear to to be the case. Here are some examples:

NorthEast (sub): getting init segment for stream(metadata.id 4) - id 5
Runway (sub): getting init segment for stream(metadata.id 8) - id 7
Driveway (main): getting init segment for stream(metadata.id 1) - id 8
NorthEast (main): getting init segment for stream(metadata.id 3) - id 9
NorthWest (sub): getting init segment for stream(metadata.id 6) - id 5

Curiously the same init segment id (5) is seen for two different streams.

It would be much more convenient if the initialization segment could be prefetched before opening the websocket. Since the first websocket data packet is guaranteed to to contain a keyframe, it's important not to lose it.

scottlamb commented 2 years ago

Sorry for the lack of documentation.

Curiously the same init segment id (5) is seen for two different streams.

The init segment describes properties of the stream like resolution, color depth, sample aspect ratio, frame rate (if the camera includes it), maximum H.264 bitrate, etc. It's common for two cameras of the same brand configured in the same manner to have the same data, and Moonfire NVR de-dupes identical init segments into the same id.

It would be much more convenient if the initialization segment could be prefetched before opening the websocket. Since the first websocket data packet is guaranteed to to contain a keyframe, it's important not to lose it.

The init segment for a particular stream can change. Eg if you alter your camera's resolution, after a likely gap in the live stream, it will start sending media segments with a different video sample entry id. My live view prototype probably doesn't handle this correctly, but the protocol is meant to support handling it in a race-free way. Fetching the init segment before opening the websocket wouldn't accomplish that.

I could send the init segment over the websocket initially and any time it changes, so the client doesn't have to make an extra request. This wastes some bandwidth as the init segment is super-cacheable. But I just checked and the init segment's body on one of my cameras was only 677 bytes, which doesn't seem like enough to worry about. There will be more data once we support audio, but not a lot more. I can't really think of a good reason it'd get significantly larger. So it seems fine to switch in the name of improving convenience and avoiding an occasional round trip.

clydebarrow commented 2 years ago

Sending it as part of the websocket stream makes eminent sense to me. I was not concerned about changes in resolution since I would see that as an unusual event, and one that would justify reloading the UI. Avoiding the extra roundtrips will make for faster startup of a stream especially on a high-latency link.

clydebarrow commented 2 years ago

Also, eliminating the X-Video-Sample-Entry-Id header from each packet will save much more bandwidth than the occasional init segment will consume. Are all the other headers really necessary? Decoding them in JS is time-consuming apart from anything else.

scottlamb commented 2 years ago

The other headers are meant for creating a (so far unimplemented) single UI that supports both a scrub bar of history and live view. They (should) allow the streamed live view stuff to be properly placed in the same timeline/buffer as stuff returned via /recordings + /view.mp4, avoiding weird jumps or redundant fetches. For just live view alone, and not caring too much about seeking around or knowing the exact timestamp of the video you're viewing, no, they're not necessary. I could put in an option to strip them out if it's a problem. It doesn't seem like it's too much bandwidth, though. I wouldn't have tried to optimize the bandwidth of the init segments either if I'd thought through at the time how few bytes it actually was.

I could also switch that header info to a different format, like JSON, if that's significantly easier to deal with.

clydebarrow commented 2 years ago

The content-type header (or some other way of getting the info) is required to initialise the sourcebuffer. The rest I'm not using at present. From a consumption perspective it would be ideal if the data packet had no headers, or a known size header, as right now I have to scan for a blank line to find the start of the binary data. A fixed size header would be hard given that the ContentType is variable length but maybe provide a header length byte or word at the start of each data packet.

Another option would be to provide a separate out-of-band channel to deliver the metadata.

I'd suggest mulling it over for now - things are working so no urgency to change. The only major issue right now I'm seeing is high CPU on a mobile device with several streams open.

The other headers are meant for creating a (so far unimplemented) single UI that supports both a scrub bar of history and live view.

That sounds nice, but I suspect the implementation effort might outweigh the usefulness. Probably the same benefit could be achieved by having two simultaneous streams open - the live view and a corresponding recording. I wasn't planning to enable seeking (or even pausing) the live view.

scottlamb commented 2 years ago

4-byte header length + json header + mp4 wouldn't be too hard.

Have you profiled to know if the high CPU is in this parsing? My intuition suggests it should be cheap but profiles beat intuition.

That sounds nice, but I suspect the implementation effort might outweigh the usefulness.

I don't think the unified form should be harder than separate high quality scrub bar UI and high quality live view. But those two pieces are certainly not trivial or done today either, and it's possible I'm wrong.

On a semi-related note: whenever you're ready, I'd be happy to chat about your goal for the UI you're implementing. Is it meant to be eg your alternative take on a UI, a recommended UI but maybe not the one I prototype features in, or the only UI we bother with at all. I'm happy to support your development no matter what. If we settle on the last option at least, I need to make sure I can figure my way around Kotlin when needed; and that we have compatible goals for features, user experience, and supported platforms, etc.

scottlamb / moonfire-nvr

Live stream initialization messiness #168