Relay latency caused by streams

kixelated commented 1 year ago

This was brought up by @suhasHere, and also at IETF 115 by @huitema.

Let's assume a sender wants to transmit a stream with 3k bytes passed through a relay. The relay will read from the source stream and write any data to the destination stream. The sender fragments the stream into 3 packets of 1k bytes each.

The 1st packet arrives. The QUIC library will flush the contents to the application, which then writes to the destination stream.
The 2nd packet is dropped. If will be retransmitted later when a gap is detected.
The 3rd packet arrives. The QUIC library will NOT flush the contents to the application due to the gap.
The 2nd packet is retransmitted and arrives. The QUIC library will flush the stream contents of the 2nd and 3rd packet to the application, which then writes to the destination stream.

This relay setup means we do not immediately transmit stream contents as they arrive. Each additional relay exacerbates the issue.

I will note that is not an issue specific to MoQ or even QUIC, as TCP relays suffer the same problem. However it's a larger issue for live media as latency is critical.

kixelated commented 1 year ago

This is not an issue with datagrams, as are typically relayed independently (much like IP packets). Using datagrams instead of streams is an option, although this incurs more complexity as it requires application ownership of fragmentation, retransmissions, flow control, etc.

An alternative is to recommend that relays use QUIC libraries that support reading from and writing to streams based on offsets. This adds some complexity, primarily on the sender, but it is limited to relays only. This same approach should be used for HTTP/3.

fluffy commented 1 year ago

What's the status on QUIC implementations support this offset type API for reading and writing ? I need to learn more about quic.

huitema commented 1 year ago

In theory, everything is possible. In practice, passing fragments of streams at the API is hard in two ways: it requires non-standard API, and it requires application ownership of fragmentation, etc., which is much of the complexity of supporting datagrams.

The API point can be debated for a long time but RFC 9000 is clear: section 2 stats that "Streams in QUIC provide a lightweight, ordered byte-stream abstraction to an application." Most implementations that I know provide a simple byte stream API. There is no standard API for QUIC except for Web Transport, which does not provide a "fragmented stream" API. There have been proposals for the related concept of unreliable streams, such as this abandoned draft from P. Tiesel & al. at TU Berlin, or this more recent draft from a J. Chen & al. at Bytedance. These proposals come and go, and mostly fail to get traction. In contrast, there is lots of support for the datagram extension, see for example how it is used in Masque.

From an implementation perspective, I feel that a fragmented and unordered stream API will require much of the same application complexity as datagrams. The stack will manage retransmission and flow control, but application will clearly have to manage fragmentation: receiving a set of bytes at a random offset is pretty much the same as receiving datagrams. From looking at my code for datagram support, managing fragmentation is much more complex than managing retransmissions, which relies on ack and timers from the QUIC stack. Flow control is definitely more complex, because datagrams are not subject to it. I think the application complexity for unordered streams will be very similar to that of supporting datagrams.

So in theory everything is possible, but in practice supporting datagrams is simpler than trying to add an unordered or unreliable extension to streams.

VMatrix1900 commented 1 year ago

I agree with Christian. QUIC stream should be treated like a block of data (message) which is either delivered in order or not. Splitting QUIC stream into multiple part and manage them is like reinventing the QUIC multiplexing feature on top of TCP which is too complex. That is why in my draft (https://datatracker.ietf.org/doc/draft-shi-quic-dtp/), I map block into stream 1:1. On the other hand, implementing block on top of QUIC datagram does not have to any constrain of QUIC stream.

kixelated commented 1 year ago

These would not be unreliable streams. Let me try to explain end-to-end:

The encoder would write the media bitstream to a QUIC stream in order. Let's just suppose we're using a QUIC stream per GoP, but it's the same with other fragmentation approaches.

Only relays would benefit from read/write using offsets, and only when latency is critical. I think that's important because it's not something a client or browser will ever need to support, for example.

On the receiving end, the relay would call a ReadChunk function that returns a byte array and an offset, effectively each STREAM frame. The QUIC library will ensure that there are no gaps unless the stream is reset early. This seems trivial to implement.

On the sending end, the relay would call a WriteChunk function providing the same byte array and offset. The QUIC library will block if the application tries to write beyond what flow control allows; gaps have to be filled eventually. This is the trickier side to implement, although it seems similar to how you implement retransmits. It's also more dangerous since the application can screw up.

Anyway, at the final hop, the decoder reads from each steam in order and blocks if there's a gap. This is required since the media decoder cannot handle gaps anyway. If you actually want a gap, ex. to drop a single frame, then you need to make a separate stream for just that frame.

@huitema does this make sense?

I'm grappling with the cost of asking QUIC libraries to implement new APIs. I would like to try implementing it and see if it's a reasonable ask. If it's not, we always have the option to hit the eject button and use datagrams instead.

VMatrix1900 commented 1 year ago

On the receiving end, the relay would call a ReadChunk function that returns a byte array and an offset, effectively each STREAM frame. The QUIC library will ensure that there are no gaps unless the stream is reset early. This seems trivial to implement.

On the sending end, the relay would call a WriteChunk function providing the same byte array and offset. The QUIC library will block if the application tries to write beyond what flow control allows; gaps have to be filled eventually. This is the trickier side to implement, although it seems similar to how you implement retransmits. It's also more dangerous since the application can screw up.

Anyway, at the final hop, the decoder reads from each steam in order and blocks if there's a gap. This is required since the media decoder cannot handle gaps anyway. If you actually want a gap, ex. to drop a single frame, then you need to make a separate stream for just that frame.

The readchunk and writechunk looks like send and receive datagram. And the datagram contains the offset so that final hop can reassemble the stream.

The difference is the layer in which the offset is handled. If using datagram, the offset is handled by MoQ layer at the final hop. The relay just does blind forwarding and does not care about the offset because it is in the datagram payload.

In your sendchunk solution, each hop needs to get the offset information from receiving side QUIC and put it into the sending side QUIC. Naturally the relay is aware of the offset. Do you see other use-case for the offset information in the relay?

I'm grappling with the cost of asking QUIC libraries to implement new APIs. I would like to try implementing it and see if it's a reasonable ask. If it's not, we always have the option to hit the eject button and use datagrams instead.

It seems the cost is fine at least when compared with implementing it using datagram. Set aside the cost, we should consider the use case and benefit that a new change brings and what it changes/breaks.

Does this new api introduce a new semantic into QUIC stream? Currently QUIC stream is a fifo bytes stream like a pipe. With this chunk, data inside QUIC stream can arrive out of order. Feels like a big change to the quic stream semantic. Do we have strong motivation to introduce this new semantic? Maybe more broad/general use cases?

afrind commented 1 year ago

For me streams vs datagrams is a complexity tradeoff between the endpoints and the relays. There's no question that with datagrams relays become simpler, but the application endpoints pay the price in terms of having to implement message reassembly and retransmission. The relays however may also need to do some reassembly, if it needs to find the header for a particular segment in order to know it's transmission priority or other metadata to determine what do to with a particular datagram.

I expect there to be more endpoint implementers with a greater need for simplicity, while the relays are likely to be implemented by a smaller number of relative experts. Therefore I'd prefer to concentrate the complexity there. And there's no requirement that a relay implement out of order stream forwarding, it's just an optimization they are free to implement or not.

I don't view this as changing the semantics of a stream vs what is described in the RFC. It only requires that an application allow the stream to be read in-order, but doesn't preclude other mechanisms for accessing stream data. mvfst already has a peek API that allows the application to view received stream data that has been received out of order.

Ultimately though, data wins arguments. We're hoping to implement an out-of-order stream relay as a proof of concept for the next interim or hackathon to assess the actual complexity and challenges with that design.

suhasHere commented 1 year ago

We are building a moq protocol for set of use-cases and I think it becomes an hard argument to justify to say if a stack supports a special api we get certain use-cases covered otherwise ymmv.

@afrind Glad that you are helping test out some of these ideas. Look forward to learn more and proposal submitted for everyone's benefit once its ready .. Thanks

afrind commented 1 year ago

We are building a moq protocol for set of use-cases and I think it becomes an hard argument to justify to say if a stack supports a special api we get certain use-cases covered otherwise ymmv.

I don't see a relay that is unable to forward stream data out of order as causing a use-case to be uncovered. It might just be a suboptimal relay. One of the advantages of QUIC is that it's not baked into the kernel and there's a marketplace of stacks to choose from, and many of them are open source.

VMatrix1900 commented 1 year ago

I am afraid that the ultra low latency is in conflict with the reliable/in-order delivery. What if the end-to-end latency requirement is so tight that the reordering latency is unacceptable? If reordering is done at the relay, it may accumulate at each relay. If only some relays support out-of-order forwarding, then it creates fragmentation on the relay provider implementations. How do you chain those relays together?

kixelated commented 1 year ago

I am afraid that the ultra low latency is in conflict with the reliable/in-order delivery. What if the end-to-end latency requirement is so tight that the reordering latency is unacceptable? If reordering is done at the relay, it may accumulate at each relay. If only some relays support out-of-order forwarding, then it creates fragmentation on the relay provider implementations. How do you chain those relays together?

I want to +1 what @afrind said earlier.

There's a trade-off here. We can have a simpler protocol but require more work for an optimal relay. Or we can have a more complex protocol but require less work for an optimal relay. We should quantify that work with a proof-of-concept.

At the very least, using datagrams would dramatically increase the surface area of the protocol. It might actually be more work to build an optimal relay using datagrams since the protocol would be more complex. The relay would be responsible for optimally implementing retransmissions, prioritization, fragmentation, etc instead of delegating to an existing QUIC library.

As for a fragmented ecosystem, well don't use sub-optimal relays when latency is critical. That goes for any protocol really.

huitema commented 1 year ago

On 12/12/2022 7:02 PM, kixelated wrote:

I am afraid that the ultra low latency is in conflict with the reliable/in-order delivery. What if the end-to-end latency requirement is so tight that the reordering latency is unacceptable? If reordering is done at the relay, it may accumulate at each relay. If only some relays support out-of-order forwarding, then it creates fragmentation on the relay provider implementations. How do you chain those relays together? I want to +1 what @afrind said earlier.

There's a trade-off here. We can have a simpler protocol but require more work for an optimal relay. Or we can have a more complex protocol but require less work for an optimal relay. We should quantify that work with a proof-of-concept.

At the very least, using datagrams would dramatically increase the surface area of the protocol. It might actually be more work to build an optimal relay using datagrams since the protocol would be more complex. The relay would be responsible for optimally implementing retransmissions, prioritization, fragmentation, etc instead of delegating to an existing QUIC library.

Having actually implemented such datagram relays, I don't believe it is that much harder than "delegating to the QUIC library". Each datagram carries a fragment of an object. The relays by default relay the fragments in the order they are received, unless congestion control tells them that the object shall be dropped. In that case, they just drop all the fragment of that object.

As for a fragmented ecosystem, well don't use a sub-optimal relay when latency is critical. That goes for any protocol really.

Well, yes. So, if we want the best performance, we end up sending fragments of objects as datagrams, doing reassembly end to end. That will avoid any head-of-line blocking in relays. It will also simplify implementations, by completely bypassing the flow control mechanisms of QUIC.

-- Christian Huitema

suhasHere commented 1 year ago

On 12/12/2022 7:02 PM, kixelated wrote: > I am afraid that the ultra low latency is in conflict with the reliable/in-order delivery. What if the end-to-end latency requirement is so tight that the reordering latency is unacceptable? If reordering is done at the relay, it may accumulate at each relay. If only some relays support out-of-order forwarding, then it creates fragmentation on the relay provider implementations. How do you chain those relays together? I want to +1 what @afrind said earlier. There's a trade-off here. We can have a simpler protocol but require more work for an optimal relay. Or we can have a more complex protocol but require less work for an optimal relay. We should quantify that work with a proof-of-concept. At the very least, using datagrams would dramatically increase the surface area of the protocol. It might actually be more work to build an optimal relay using datagrams since the protocol would be more complex. The relay would be responsible for optimally implementing retransmissions, prioritization, fragmentation, etc instead of delegating to an existing QUIC library. Having actually implemented such datagram relays, I don't believe it is that much harder than "delegating to the QUIC library". Each datagram carries a fragment of an object. The relays by default relay the fragments in the order they are received, unless congestion control tells them that the object shall be dropped. In that case, they just drop all the fragment of that object. As for a fragmented ecosystem, well don't use a sub-optimal relay when latency is critical. That goes for any protocol really. Well, yes. So, if we want the best performance, we end up sending fragments of objects as datagrams, doing reassembly end to end. That will avoid any head-of-line blocking in relays. It will also simplify implementations, by completely bypassing the flow control mechanisms of QUIC. … -- Christian Huitema

I would like to +1 to Christian's comment. Reassembly only needs to happen at the end , unless Relays want to store the things as full objects instead of fragments, which is not the case typically. Relays gets fragments in and sends fragments out and store/drop fragments as needed by caching policy.

suhasHere commented 1 year ago

I am afraid that the ultra low latency is in conflict with the reliable/in-order delivery. What if the end-to-end latency requirement is so tight that the reordering latency is unacceptable? If reordering is done at the relay, it may accumulate at each relay. If only some relays support out-of-order forwarding, then it creates fragmentation on the relay provider implementations. How do you chain those relays together?

I don't see a need to reorder DATAGRAM fragments at the relays ( which could be the case with streams though) .. If there is a need for full objects, the publisher can say it needs reliable transport and it uses Streams in such cases.

afrind commented 1 year ago

Reassembly only needs to happen at the end , unless Relays want to store the things as full objects instead of fragments

Aren't some relays actually caches, and wouldn't they want to store things as full objects?

VMatrix1900 commented 1 year ago

Aren't some relays actually caches, and wouldn't they want to store things as full objects?

The store and relay of the datagram can be done in two threads parrallely. The relay thread just forwards datagram as it comes without reordering. The store thread needs to do reorder.

ianswett commented 9 months ago

This seems like an optimization one can do at the relay if the necessary library API is present. Is this something that needs text and something like a RECOMMENDED normative statement, or just leave it up to relays to do their best?

I'll also note that packet loss is likely going to be lower upstream of the relay than downstream, at least based on my experience.

kixelated commented 9 months ago

This seems like an optimization one can do at the relay if the necessary library API is present. Is this something that needs text and something like a RECOMMENDED normative statement, or just leave it up to relays to do their best?

Yeah, I don't even think we mention it. It only matters for real-time latency over bandwidth constrained relays; situational at best.

moq-wg / moq-transport

Relay latency caused by streams #38