moq-wg / moq-transport

draft-ietf-moq-transport
Other
70 stars 16 forks source link

Group IDs and gaps #427

Closed vasilvv closed 2 months ago

vasilvv commented 2 months ago

As far as I understand, we currently allow group IDs to have gaps in them, with only requirement for them is that they are monotonically increasing. This presents a problem with caching. Assume I am a relay, and I receive a range request for [10, 20); if the origin has never produced group 15, then even if I have the entirety of that range in cache, I would not know that, since 15 would be just missing, and the only thing I can do with that is to request 15 from the origin every time.

Possible solutions to this problem:

  1. Add some mechanism to indicate that 15 is not there (e.g. by putting the number of previous group in the beginning of every group).
  2. Require group numbers to be incrementing by one.

I am fan of approach number 2 since it's conceptually simpler, but I think we can make some version of 1 work.

suhasHere commented 2 months ago

I am not fan of approach 2 since it is limiting and applies to typical media applications only.

I think we need to understand that, the application controls the groupIds and they are independent sync points. Catalog should provide that information for the players.

Regarding if there is a gap due to object being dropped, that information is identified within the object itself.

Let's take an application example where the group numbers increment by 2. So the groups would be 10,12,14,16,18,20 in the above example. Now consider that group 12 was dropped by the publisher , then the relay's cache will be in this form [10, 12 (not sent by the publisher), 14,16,18, 20]

Keeping relay unaware of application semantics is something we should allow. Regardless, the end applications know exactly how the groups behave and if there are geps (either due to broadcaster not producing or dropped for other reasons), things will be identified over dataplane as such

afrind commented 2 months ago

@suhasHere: can you give a specific use case for how an application could use non-sequential group IDs for some benefit?

vasilvv commented 2 months ago

Let's take an application example where the group numbers increment by 2. So the groups would be 10,12,14,16,18,20 in the above example. Now consider that group 12 was not sent by the publisher , then the relay's cache will be in this form [10, 12 (not sent by the publisher), 14,16,18, 20]

The problem I have here is not with 12, the problem here is that the relay cannot distinguish 11,13,15,17,19 being in a state "has not arrived yet" vs state "does not and will never exist", so it will be always forced to fetch those.

suhasHere commented 2 months ago

Let's take an application example where the group numbers increment by 2. So the groups would be 10,12,14,16,18,20 in the above example. Now consider that group 12 was not sent by the publisher , then the relay's cache will be in this form [10, 12 (not sent by the publisher), 14,16,18, 20]

The problem I have here is not with 12, the problem here is that the relay cannot distinguish 11,13,15,17,19 being in a state "has not arrived yet" vs state "does not and will never exist", so it will be always forced to fetch those.

The relay should not worry about the semantic structure . Relay needs to fetch only those that have been marked as dropped. If it hasn't arrived, then it will eventually arrive or the upstream will mark it as dropped. In this example, Relay fetches 12 and sends rest from its cache.

If relay makes decision on behalf of the application, it will end up fetching things that are not even part of the application ( here the groups 11,13,15,17,19)

My proposal is to act on the things you know explicitly and don't guess on the unknown.

vasilvv commented 2 months ago

I'm confused. How is the relay supposed to know that when being asked for range [10, 20), it does not need to fetch 15?

suhasHere commented 2 months ago

I'm confused. How is the relay supposed to know that when being asked for range [10, 20), it does not need to fetch 15?

Say relay cache is empty, it will ask upstream from range [10-20) .. The upstream publisher will send 10,12,16,18, 20 .. Relay responds with that. OTOH if the upstream provides 10, 12,, 16, 18(dropped due TTL), 20 and if the operation is fetch , the relay will fetch 18 to fill in the gap and once 18 arrives , it will be able to send 18 with others.

vasilvv commented 2 months ago

Sure, let's assume there are no drops, the first fetch is successful, and it puts 10, 12, 14, 16, 18 in cache.

Now the second fetch arrives for the exact same range, [10, 20). It looks at its cache; it has objects 10, 12, 14, 16, 18, that it can return immediately, but it doesn't know if that's all objects that exist in that range or not (we know that because we know that the previous request was [10, 20), but it is entirely possible that those objects ended up in cache since they were results of five different fetches), so it has to repeat the fetch for those.

kixelated commented 2 months ago

I strongly think we should have sequential groups.

Like Victor said, the group ID can be used to detect gaps at the moq-transport layer if it's sequential. I understand the desire to stuff application-specific metadata into fields, but it's a layer violation. Put application-specific stuff in the object payload (ex. timestamps) and put properties useful to relays in the moq-transport layer.

suhasHere commented 2 months ago

GroupIds are just application sync points and are by definition independent of each other. Adding a correlation between them can be invalid/incorrect/misleading as a Relay is unaware of how application defines its sync points.

OTOH objects have dependencies within a group, hence the app decided to send them part of a given group and having relay maintain a state to find out gaps may be useful.

wilaw commented 2 months ago

I am also a supporter of incrementally increasing Group IDs. Having such IDs does not break the contract that these points are independent of one-another. It simply means that they occur in a sequence. And being part of a sequence is a fundamental attribute of all MOQT groups, since their parent, a track, represents a temporal flow of data.

Now the second fetch arrives for the exact same range, [10, 20). It looks at its cache; it has objects 10, 12, 14, 16, 18, that it can return immediately, but it doesn't know if that's all objects that exist in that range or not (we know that because we know that the previous request was [10, 20), but it is entirely possible that those objects ended up in cache since they were results of five different fetches), so it has to repeat the fetch for those.

This example from Victor illustrates the core problem. For anything other than a contiguous sequence of group numbers, a relay can never know if the Groups exist or not and it would always have to make a request upstream to check. It could keep track of prior subscriptions (for example [10..12], [14..16], [18..20]), however if any of those did not overlap then it would be forced to make a fetch for [12..14] and [16..18] in case groups 13 and 17 existed. That is a wasted and convoluted workflow.

I tried to think of a use-case where I would want non-incremental group numbers. One example would be to use epoch time as group number. So if I'm making variable length segments, and doing segment-per-group, I might number my groups:

1712302548317, 1712302550625, 1712302551901, 1712302553412

A player wanting to seek back in time would need a timeline track to tell it the precise group number it needed. That timeline track could equally map group numbers to epoch time, which would allow me to number my groups incrementally:

1,2,3,4

The meta-point here is Group IDs only define sequence and any other relationship between those sync points can be expressed via an application level timeline track and/or catalog.

suhasHere commented 2 months ago

And being part of a sequence is a fundamental attribute of all MOQT groups, since their parent, a track, represents a temporal flow of data.

temporal flow of data doesn't imply sequential nature of groups either.

suhasHere commented 2 months ago

Now the second fetch arrives for the exact same range, [10, 20). It looks at its cache; it has objects 10, 12, 14, 16, 18, that it can return immediately, but it doesn't know if that's all objects that exist in that range or not

Second fetch gets the exact same answer (10,12,14,16,18 since it falls in the range). If there were gaps due to groups being dropped on path or a source deciding to skip a group , those should be marked for relay to see if it needs to satisfy fetch request.

I feel it will help us to think on why gaps can occur and those can be explicitly signaled . Relays then in application agnostic way acts on the data in cache.

kixelated commented 2 months ago

@suhas How does the subscriber know the SUBSCRIBE/FETCH is done? #424

We have a SUBSCRIBE_DONE message with end=20. The subscriber receives 10,12,14,16,18,20 but when does it stop waiting? Is group 19 still in transit? There's no drop notification for it since it doesn't exist.

suhasHere commented 2 months ago

@Suhas How does the subscriber know the SUBSCRIBE/FETCH is done? #424

We have a SUBSCRIBE_DONE message with end=20. The subscriber receives 10,12,14,16,18,20 but when does it stop waiting? Is group 19 still in transit? There's no drop notification for it since it doesn't exist.

Subscriber application knows its catalog and deduce the group distribution and find out there will be no grpup 19.

kixelated commented 2 months ago

@Suhas How does the subscriber know the SUBSCRIBE/FETCH is done? #424 We have a SUBSCRIBE_DONE message with end=20. The subscriber receives 10,12,14,16,18,20 but when does it stop waiting? Is group 19 still in transit? There's no drop notification for it since it doesn't exist.

Subscriber application knows its catalog and deduce the group distribution and find out there will be no grpup 19.

  1. The moq-transport library should tell the application that a subscription has terminated, and not the other way around. Especially since the moq-transport library can't release any state until this optional signal is provided by the application (which might not care about gaps).

  2. If the application needs to use a catalog/timeline to detect gaps, then it can use that same catalog/timeline to carry the timestamp that is being shoved into the group_id. Or it can put it in the object payload.

kixelated commented 2 months ago

So we're talking about two different fields and I'm going to use disambiguating names:

group_sequence: Increases by 1 each group. Provides ordering and gap detection. group_epoch: Increases by an application specific amount for each group (no units). Provides ordering.

There's a world where both are separate properties (and fields) of a group in the moq-transport layer. Maybe the relay could use the epoch (as a timestamp with units) to perform TTLs, or latency budgets, or fetches.

But I think we should punt timestamps and metadata in general to the application for now. I'm also not sure how this would work in practice anyway without an equivalent object_epoch, but object_ids are already sequential.

wilaw commented 2 months ago

temporal flow of data doesn't imply sequential nature of groups either.

We define a track (our subscribable entity) as a "sequence of Groups". That is part of our object model. There is no part of a track which does not belong to a group. Therefore, a track consists of a sequence of groups.

Screenshot 2024-04-05 at 5 19 54 PM

This diagram shows three tracks. The first two tracks are valid, one with temporally contiguous and the other with temporally non-contiguous groups. The third track is not valid, as we don't allow a publisher to produce objects at the same time in the same track which belong to different groups.

suhasHere commented 2 months ago

This diagram shows three tracks. The first two tracks are valid, one with temporally contiguous and the other with temporally non-contiguous groups. The third track is not valid, as we don't allow a publisher to produce objects at the same time in the same track which belong to different groups.

this looks like an example constructed incorrectly. None of my points above said overlapping groups Ids. All I am saying is group Ids are independent and don't have to increase in sequence.

As you pointed out, both track 1 and track 2 are valid, which i 100% agree. I am proposing we shouldn't force an application to follow what looks like just the track 1 model, since track 2 is also a valid group distribution. MoQT should have necessary things to allow track 1 and track 2, if not, let's define it.

suhasHere commented 2 months ago

2. If the application needs to use a catalog/timeline to detect gaps, then it can use that same catalog/timeline to carry the timestamp that is being shoved into the group_id. Or it can put it in the object payload.

I think you missed the point in my comments . No catalog is not used to define gaps. Gaps are signaled in the data plane

suhasHere commented 2 months ago
  1. The moq-transport library should tell the application that a subscription has terminated, and not the other way around. Especially since the moq-transport library can't release any state until this optional signal is provided by the application (which might not care about gaps).

Exactly. I am not arguing either. All I am saying is, when a subscribe done is processed and reported to the application, the application ha enough information to know if there needs to be group 19 or not, from your example. MoQ transport library reports status of what it has seen and only the application knows more about, if group 19 was supposed to be even produced for a given track or not , since that info is in catalog. Please note , this is nothing to do with GAPs. Gaps happens when things are dropped for variety of reasons. Track properties/characteristics (like number of groups, how groupIds are used, group duration) is catalog scoped and not up to for the moq transport to enforce. This is just the principle of separation of concerns across layers.

kixelated commented 2 months ago
  1. The moq-transport library should tell the application that a subscription has terminated, and not the other way around. Especially since the moq-transport library can't release any state until this optional signal is provided by the application (which might not care about gaps).

Exactly. I am not arguing either. All I am saying is, when a subscribe done is processed and reported to the application, the application ha enough information to know if there needs to be group 19 or not, from your example.

In the current state, since the moq-transport library cannot detect gaps, it has to immediately report any SUBSCRIBE_DONE to the application. OBJECTs may still be received during this state so the library does not drop state associated with the session. The application has to tell the library when the subscription is actually done so it can remove the lookup table entry.

With sequential groups and drop notifications, the moq-transport library can buffer the SUBSCRIBE_DONE until it has received and flushed all OBJECTs to the application. The application only learns the subscription is done after receiving the last object. This is how transport APIs generally work and how QUIC works itself.

kixelated commented 2 months ago

And just to draw inspiration from QUIC itself, it does not allow gaps in the Stream ID or Stream Offset (roughly analogous to group_id and object_id). QUIC could have allowed the application to introduce gaps in either but it just complicates everything.

suhasHere commented 2 months ago

And just to draw inspiration from QUIC itself, it does not allow gaps in the Stream ID or Stream Offset (roughly analogous to group_id and object_id). QUIC could have allowed the application to introduce gaps in either but it just complicates everything.

Not sure how to interepret this in the context of this discussion. QUIC is transport layer protocol and MOQT is application protocol. Also StreamIds and GroupIds are semantically different.

afrind commented 2 months ago

Chair Comment:

I think the WG can build the system to support either sequential or non-sequential groups.

Several folks have put forward reasoning for why sequential groups provide benefit and/or why non-sequential groups add complexity.

@suhasHere as the lead proponent for non-sequential groups, can you state the use case and application benefits. This will help us evaluate the tradeoffs.

vasilvv commented 2 months ago

Second fetch gets the exact same answer (10,12,14,16,18 since it falls in the range). If there were gaps due to groups being dropped on path or a source deciding to skip a group , those should be marked for relay to see if it needs to satisfy fetch request.

The problem is how to resolve this on the cache. Right now, the cache, when it does not have any information about the state of object 15, cannot tell apart the scenario where 15 does not exist, vs the scenario where 15 is missing from the cache because the cache got filled by one client doing a [10, 14] request, and then another doing [16, 20].

For anything other than a contiguous sequence of group numbers, a relay can never know if the Groups exist or not and it would always have to make a request upstream to check. It could keep track of prior subscriptions (for example [10..12], [14..16], [18..20]), however if any of those did not overlap then it would be forced to make a fetch for [12..14] and [16..18] in case groups 13 and 17 existed. That is a wasted and convoluted workflow.

It is possible to work this around by putting a "does not exist" kind of entry into cache on FETCH: if I fetch [10, 15], and get 10, 11, 15, I can cache "[12, 14] does not exist" from my knowledge of the properties FETCH provides. Outside of being more complex, this also does not solve the entire problem, since our goal is to be able to cachefill from SUBSCRIBEs as well as FETCHes. So non-sequential group IDs inherently makes caching less efficient.

suhasHere commented 2 months ago

The problem is how to resolve this on the cache. Right now, the cache, when it does not have any information about the state of object 15, cannot tell apart the scenario where 15 does not exist, vs the scenario where 15 is missing from the cache because the cache got filled by one client doing a [10, 14] request, and then another doing [16, 20].

Let's take the example of sequential groups for a similar use-case Fetch 1: 10-14, cache is filled with 10, 11,12,13,14 Fetch 2: 18-20: cache is now filled with 10, 11,12,13,14, 18.19,20

now fetch 3 arrives with 10-20, Cache doesn't know the state of 15,16,17 . So as it needs to do, it will have to ask upstream for [15-18). In the case where 15 was never produced for whatever reason, the publisher will need to mark it as such in the response regardless (like 15(not produced), 16, 17), as this will aid subsequent requests to not repeat the same request.

One observation on the cacheFill, When producer sends groups and objects, either as response to subscribe or fetch, the cache fill happens regardless. Let's say when subscription was made , an object was marked as dropped due to congestion. An eventual fetch request can end up replacing that object with proper object if it is still available upstream.

vasilvv commented 2 months ago

In the case where 15 was never produced for whatever reason, the publisher will need to mark it as such in the response regardless (like 15(not produced), 16, 17), as this will aid subsequent requests to not repeat the same request.

Are you saying that the publisher is allowed to skip group IDs, but it has to explicitly mark groups in the middle as missing? This de-facto makes them sequential.

suhasHere commented 2 months ago

In the case where 15 was never produced for whatever reason, the publisher will need to mark it as such in the response regardless (like 15(not produced), 16, 17), as this will aid subsequent requests to not repeat the same request.

Are you saying that the publisher is allowed to skip group IDs, but it has to explicitly mark groups in the middle as missing? This de-facto makes them sequential.

Not exactly that. This is the gap use-case that had a requirement for players who want to know if a given group that is expected is marked explicitly when source drops, thus allowing player to not wait forever.

There are couple of things we are talking here IIUC.

  1. Publisher explicitly marking things when it decides to drop.
  2. Publisher responding to the request for a non existent things (never produced, permanently gone or dropped)
  3. Players knowing not to expect certain groupIds since they know from catalog that the group distribution will not generate such groupIds.
  4. Players knowing not to wait for missing things.
  5. Cache-fill from either subscribe and/or publish.
  6. Relay asking upstream for things that is not its cache
  7. Clients and Relays need to be able to process groups out of order.

All these can be done or needed regardless of how group numbers are distributed.

suhasHere commented 2 months ago

the only thing I can do with that is to request 15 from the origin every time.

Want to clarify that, one needs to request for 15 only one time, if and only if the fetch request asks for it. Also a client who is aware that group 15 will not be generated, per catalog, will not ask for it either. Regardless of the group distribution, relays can be asked for things that it not in its cache.

Asking origin for that evaluates to a non existent things ( never produced, not produced since origin decided to drop it or permanently gone from the origin) needs to be updated in the cache. Since a fetch can happen for things that fall in to any of these cases and a relay that doesn't have that information in the cache will need to go back and find out the answer and it needs to do only once.

kixelated commented 2 months ago

In the case where 15 was never produced for whatever reason, the publisher will need to mark it as such in the response regardless (like 15(not produced), 16, 17), as this will aid subsequent requests to not repeat the same request.

Are you saying that the publisher is allowed to skip group IDs, but it has to explicitly mark groups in the middle as missing? This de-facto makes them sequential.

I want to second this. @suhasHere would allowing groups of size 0 work for you?

The problem today is the ambiguity. A subscriber doesn't know if 15 is in flight or will never arrive. If the publisher explicitly tells the subscriber that 15 is empty (or dropped) then that solves things.

suhasHere commented 2 months ago

I want to second this. @suhasHere would allowing groups of size 0 work for you?

yes if there is a reason for a producer to drop entire groups and should be explicitly signaled.

_"The problem today is the ambiguity. A subscriber doesn't know if 15 is in flight or will never arrive. If the publisher explicitly tells the subscriber that 15 is empty (or dropped) then that solves _things."__

We need things to be marked to indicate things when dropped at the publisher. This is needed for reasons pointed out in https://github.com/moq-wg/moq-transport/issues/427#issuecomment-2041101690 and https://github.com/moq-wg/moq-transport/issues/427#issuecomment-2040965981

fluffy commented 2 months ago

I don't get what the problem is here. The things the relays cache is the object.

I am very concerned that any assumption on groups being sequential are being produced in order are not going to work with the way people want to use complex reference frames and layer in modern codecs like AV1. Of course an given application like youtube might specify for it's usage of moq, everything was sequential. But the moq relays don't need to know this or assume this and keeping them flexible allows for all kinds of interesting media uses cases for media beyond video.

I have helped write a relay, and I don't see what problem this is causing for relays.

vasilvv commented 2 months ago

yes if there is a reason for a producer to drop entire groups and should be explicitly signaled.

Ah, my concern was that someone would try to have groups numbered 0, 1000, 2000, or use group ID as a timestamp, in which case you'd have to send hundreds of gap indicators. If that is out of scope, simple "a group is missing" would work.

suhasHere commented 2 months ago

yes if there is a reason for a producer to drop entire groups and should be explicitly signaled.

Ah, my concern was that someone would try to have groups numbered 0, 1000, 2000, or use group ID as a timestamp, in which case you'd have to send hundreds of gap indicators. If that is out of scope, simple "a group is missing" would work.

+1. PR #429 provides aspects of that for the cases when the publisher explicitly drops a group IIRC.