moq-wg / moq-transport

draft-ietf-moq-transport
Other
79 stars 18 forks source link

Broadcast publisher reconnect support #79

Open kpugin opened 1 year ago

kpugin commented 1 year ago

In Model section it says that each Media object is uniquely identified within broadcast, track, etc. I am wondering how reconnect would work - wouldn't it require publisher to maintain state? What happens if one encoder box fails and broadcast needs to be resumed from another box?

fluffy commented 9 months ago

What happens in a HTTP CDN today if a cache asks for an file from origin, origin says it is cachable for a long time, then the origin servers crashes when it comes up generates a different file, with the same name. What will happen is pretty broken on different clients get different version of the file.

We have the same issue here, if two clients publish an object with different data but same full track name, group ID, and object ID, then it is undefined which version will get served to any given subscription. This is going to be the nature of any design that provides both aggregation and caching and has unique names.

afrind commented 9 months ago

Individual Comment:

HTTP has eTags to help deal with this problem. This is not an endorsement that we should add eTags to moq.

wilaw commented 9 months ago

The OP raises two questions - how to handle reconnects and how to handle failovers. These are separate behaviors and should be addressed separately. Here's how it might work:

Reconnect A publishers connects the first time and ANNOUNCES a namespace+name. It receives a subscribe request, starts publishing and then has a loss of connection. Its connection fails midway through publishing a group and object. A minute later, its connection returns. Prior to resuming publishing the track, it can publish a catalog update in which it describes a discontinuity in the track. It can list the the last known good group and number and also the group and number on which it will resume publishing. It then resumes publishing at the new group and number. Since end-clients should always be subscribed to catalog updates, they would be informed of the discontinuity and could purge their buffers and reset their decoders accordingly. This behavior would give publishers the choice or resuming where they left off (and thus falling behind in latency) or resuming at the live point.

Failover I don't think 3rd party networks should be responsible for seamlessly repairing tracks. This means that the notion of "primary" and "backup" versions of a track exists only between publishers and end-subscribers. We can use the catalog, published by each redundant publisher, to describe the failover track and whether it is timeline-identical, or not. If an end-subscriber times-out against the primary source (or receives some sort of error message), then it can switch itself to the backup source. These primary and backup sources must have different namespace+name tuples so that the network caches them independently.

kixelated commented 9 months ago

I am wondering how reconnect would work - wouldn't it require publisher to maintain state? What happens if one encoder box fails and broadcast needs to be resumed from another box?

The publisher MUST retain state for any reconnect to work. Even something as trivial as group sequence numbers needs to be maintained, otherwise you run into split brain and potential cache poisoning. Consider a player that receives group 69 and then suddenly group 0 (because of a stateless reconnect): it will just throw it out. And of course tracks need to be consistent; the new encoder can't just use different encodings or identification on a whim.

We should absolutely support publishers that can seamlessly continue a broadcast if they can persist the required state. Like @afrind mentioned, I think this would look more like an ETag. The new ANNOUNCE would specify that it's a continuation of the original ANNOUNCE if they have matching IDs.

suhasHere commented 9 months ago

IOT media publishers will for sure not maintain any state on restarts. I think a system where we are using static names and maintain caches based on that and expect it to work over restarts is a brittle system and doesn't scale really well where the system is dynamic and networks/connections come and go.

ianswett commented 7 months ago

Possibly this can be fixed as part of @fluffy PRs for make before break?