Out of order stream data delivery

huitema commented 4 years ago

I am doing tests of transport over satellite. When testing with lossy links, I observed that the CPU load exploded and quickly traced that to a bug in the handling of out of order stream frames. The current API requires orderly delivery, and the current code was doing that using a linear search in the list of already queued frames. With large receive windows W, that means a cost of O(W**2). Of course that's a bug, and I should fix it and get a cost of O(W) or O(W.logW). But before I do that, I wonder whether I should fix the API and give the application a chance to take care of reordering itself.

I have two examples in mind: receiving large chunks of data in memory, and receiving video. When receiving data in memory, the application could easily copy out of order frames at the received location, and keep track of holes itself, for a cost way lower than O(W.logW). When receiving video, the application may want to skip over a hole and process received data without waiting a round trip delay or two for the correction to arrive. So before I rush to implement something complicated, maybe we should have a little discussion on doing the work in the transport stack versus the application layer.

I am thinking of allowing applications to "accept out of order data", maybe per stream, and then add a specific "out of order data" code point in the callback API. @igorlord, @steschu77 , @bkchr I understand that you have interest in out of order delivery. What would a desired API look like?

steschu77 commented 4 years ago

That is a pretty nice idea and would certainly help with my video experiments.

Regarding the API question, in the spirit of simplicity, I'd just hand out data as it comes with the offset within the stream. Thus, providing a callback to the application very much like picoquic_stream_network_input() in frames.c.

steschu77 commented 4 years ago

Of course, an application would need to signal which type of callback it wishes to receive (in-order or out-of-order delivery).

The kind of callback should be a per stream decision if possible and signaled whenever a stream is created, locally or remotely. For that we would need a "new stream" notification/callback which doesn't seem to exist in picoquic at the moment.

Therefore it could be implemented in two steps. First all streams are either in-order or out-of-order and then with a "new stream" notification it could be a per stream decision.

huitema commented 4 years ago

I rewrote the handling of partially received stream data using a splay instead of a linear list in PR #742. The PR does solve the performance issue caused by the linear search, but does not include the new API. That will have to come in a separate check-in, with a specific test.

If we want partial reliability, we probably also want to control repeats and error corrections. On the receiver side, we can easily add a "not interested" API, that will say "don't bother me with stream X data at offset < N". That's purely a local action, and it can be done without new protocol elements. We could also do the same on the sender side, "don't bother repeating stream X data at offset < N", but that will cause the connection to break if the receiver is still waiting for the data. We could solve that by a new protocol element, but that will have to be defined in an extension.

steschu77 commented 4 years ago

For video transmission, we should control re-transmission on sender side. The sender knows what kind of data has been lost and for video that makes a difference. While you would continue re-transmitting lost parts of a key-frame, you probably wouldn't do so for non-referenced inter frames.

That requires a more detailed control mechanism than "ignore re-transmits < N". I prefer some kind of callback mechanism where picoquic asks for each loss per stream if it is worth re-transmitting. And for cases where it is not worth re-transmitting, let's introduce a new protocol element "gap notification" with offset and length, which means that data from offset to offset+length can be ignored on receiver side, even though it has been received (late).

For a video receiver a gap notification would mean to not pass data from the gap to a video decoder in order to avoid a broken reference frame chain. The encoder on sender side has to make sure that everything excluding the gap is still a valid and decodable video sequence.

huitema commented 4 years ago

Trying to summarize the design requirements:

1) Sender side: when ready to re-transmit, do a call to the application. "Considering re-transmit L bytes at offset X for stream S, proceed?" Possibly only do that if the stream is somehow marked as special.

2) Receive side: when receiving data, leave re-sequencing and hole filling to the application. Just send notification, "Here are L bytes at offset X for stream S". Possibly only do that if the stream is marked as special.

Potential issue: what if the receiver expects data in order, TCP style, but the sender decides to not re-transmit a past frame?

huitema commented 4 years ago

Thinking a bit more about this. I think I will proceed in steps:

1) Remove the requirement that the transport delivers stream data in order. That means changing the API from "here are 1200 bytes at the end of the stream" to "here are 1200 bytes at offset 12345".

2) Add a partial receive extension, with a "partial reset" frame, "PARTIAL RESET Stream X at offset 12345", indicating to the remote transport & the remote application that bytes before the selected offset will not be resent.

3) Change the packet re-transmission strategy, so as to not repeat bytes sent before the last PARTIAL RESET on the stream.

The first change is disruptive, but could be handled similarly to the "active stream" API. At some point, we may want to clean up the old code paths, but that can probably wait.

huitema commented 4 years ago

I would like to keep the design close to the "active stream" mark on the sending side. This is easy on the side of the stream creator: they can set the "active receiver" mark just when the stream is created. But that's less easy on the receive side. The application will be notified that data is ready on a new stream, and only then can it decide that the stream should be marked active. This suggest an API that could be called inside the "data received" call back.

That callback is currently only called when data arrives in sequence. Out of sequence data received first will be stored. That means the out of sequence data will have to be sent in a subsequent call back.

The other difference is handling holes. In the simplest design, marking a stream as "active receiver" implies that the application takes control of arranging the data from a set of out of sequence packets, fill holes and remove duplicates. But that means that the application can never "hand back" the stream to the stack and have the stack manage the reminder of the conversation the old way. That means that the transition would be just one way: the application takes control, and at the end it closes the stream.

An option would be for apps to opt-in on a per connection basis. Is that practical? What for example of apps that combine media and data?

The out of sequence callback needs an offset parameter. Should we change the existing callback call to take another parameter, or should we just pass a <data, offset, length, fin> structure to the new API?

fluffy commented 4 years ago

I like the idea of being able to see out of order data. That really helps for video and probably gaming too. For video, my long term hope for QUIC is that it allows some of the parts of the video flow corresponding to a key part of of an IFrame to be sent with re-transmission but other parts of the video flow to be sent without re-transmition.

huitema commented 4 years ago

Yes, it is a good idea to think about re-transmission as well. Currently, re-transmission is handled by the transport logic:

Keep a copy of all packets previously sent.
Manage timers, ACKS, etc., and find when a packet is probably lost
Examine the frames in the lost packet and check whether they need re-transmission.
Package the frames that need re-transmission in a new packet and send it.

The decision of whether a frame needs re-transmission depends on the frame type and the frame content. For example, ACK frames are never re-sent, and MAX DATA frames are only re-sent if they were not superseded by a frame advertising a larger limit.

For stream frames, the transport checks whether the data in the frame has already been acknowledged, and only re-sends the data that was not. This requires checking the data in acknowledged packets, and maintaining a data structure for each stream listing the acknowledged parts and the remaining holes. I spent quite a bit of time making sure that this works well, and does not cause performance issues in high loss conditions.

Instead of making its own decisions, the transport could call back the application and asks whether data at offset X still needs to be re-transmitted. That simple call would be easy to add to the decision logic in the current code. First check whether the data was not already acknowledged, and then if it was not also ask the application.

I wonder however if we should not go one step further, and delegate more data handling to the "active" application. In that scenario, the transport would not keep a copy of the data for re-transmission; it would just keep a pointer, "offset X, length L". Instead of a callback asking whether data at offset X should be re-transmitted, the transport would also ask for the actual data. There are potential performance gains, reducing the number of memory copies and more efficient management of jumbo packets. But I am worried that this is pushing a lot of complexity to the application.

huitema commented 4 years ago

Initial implementation of out of order delivery in PR #871. The design needs a lot of review, so feel free to comment on that PR!

AbhishiktaP commented 4 years ago

I am currently trying to test the reduction in latency by sending data unreliable over QUIC Can unreliability be achieved if the function in sender.c picoquic_retransmit_needed returns a null value?

huitema commented 4 years ago

@AbhishiktaP You have to be careful there, because messing with error recovery will affect all types of frames, not just stream data frames. For example, if you fail to retransmit flow control frames, the connection will eventually stop. You also want to make sure that both sides of the connection are modified to not do retransmission and not expect retransmission, otherwise the receiver will forever wait for holes to be filled before delivering any data.

I suggest that you proceed in steps. The first step is, make sure that your application uses the 'direct receive' API on the receive side. It is documented in picoquic.h, lines 615 to 655. That means making sure that your application declares streams as "direct receive" and provides a "direct receive" callback using picoquic_mark_direct_receive_stream()

The direct receive API will deliver data to the application as soon as it arrives, which lets you do latency measurements. However, missing data will still be retransmitted, which means that some amount of bandwidth will be wasted on unnecessary retransmission. If the packet loss rate is not too high, I believe this is a second order effect. However, you may want to turn of stream data retransmission in order to measure that. This requires working at both the receive side and the sender side. In theory, this requires negotiating a protocol extension, to make sure that both sides agree. For an experiment, it might be simpler to just modify the code.

On the sender side, you want to look at picoquic_copy_before_retransmit in sender.c. The code answers the question, "shall this frame be resent". There is a special case for stream data frames. Currently, the answer is "yes retransmit" if the stream data has not been acked. You want that answer to me "no need to retransmit" if the stream is not reliable. You may however want to only do that if the stream "FIN" bit is not set, otherwise the stream will never close.

AbhishiktaP commented 4 years ago

Thank you so much! I will try making the modifications as you suggested. I request you to keep the issue open till then.

AbhishiktaP commented 4 years ago

In the sender.c code, will removing the if condition on line 1240 be enough to make the stream unreliable on the sender's side? Which part of the code looks into the receiver side to make the negotiation in the protocol to allow the receiver to be marked as direct receive?

huitema commented 4 years ago

Something like that. But you should rather look at the function picoquic_check_frame_needs_repeat in frames.c. In the case of stream frames, you could just always answer "no, it doesn't". Or you may want to test the sream number to make some streams unreliable and some others not.

AbhishiktaP commented 4 years ago

Thank you! Which part of the code handles the receiver side where I could implement the direct receive api? The frame type picoquic_frame_type_datagram in picoquic_internal.h should be checked as condition of "no-retransmission" in picoquic_check_frame_needs_repeat?

huitema commented 4 years ago

@AbhishiktaP The datagram frames are not repeated, that's part of the definition.

Check the code in picoquic_stream_network_input() in frame.c, and you will see the graph of dependencies. You need to write your application so its sets a direct receive callback function, instead of relying on ordered delivery. This should be done with the picoquic_mark_direct_receive_stream() API.

AbhishiktaP commented 4 years ago

In frames.c the line number 2226 has the function call to picoquic_check_sack_list which returns 0 is the data fills a hole or -1 if it doesn't. Where can I stop the data from being resent to fill up the holes?

huitema commented 4 years ago

Yes, that's a problem with unreliable delivery. The sender does not know that the receiver does not need that data. I think that you solved the problem, at least partially, when you modified picoquic_check_frame_needs_repeat for your experiment. But you also need to simulate receiving an acknowledgement for the missing data.

The acknowledged list is updated when a packet containing a stream frame is acknowledged, in picoquic_process_ack_of_stream_frame in frame.c. If you want to not repeat a part of the stream data, you need to simulate receiving an acknowledgement, with code similar to that in picoquic_process_ack_of_stream_frame. Something like:

        /* simulate acknowledgement of the stream up to offset = X */
        stream = picoquic_find_stream(cnx, stream_id);
        if (stream != NULL) {
            (void)picoquic_update_sack_list(&stream->first_sack_item,
                0, offset-1);
            picoquic_delete_stream_if_closed(cnx, stream);
        }

You may want to insert that call in your modified version of picoquic_check_frame_needs_repeat. That will be enough for an experiment.

Once we have seen the results of your experiment, we may want to do something more complicated. For example, some application will want to reliably transmit some part of the stream such as "I-Frames" for video. They will need to make sure that these get repeated if needed, but they may be ready to discard some of the other data. We will need to provide an API so the application can express this kind of things. And we will need a protocol extension to enable sender and receiver to synchronize such decisions.

AbhishiktaP commented 4 years ago

For the purpose of experimentation, how should I introduce pseudo data loss to check for unreliabilty?

huitema commented 4 years ago

It depends on your setup. Many experimenters place a network simulator between client and server. You could also look at the code that receives or send packets in picoquicdemo, and randomly drop some of the packets that you receive.

AbhishiktaP commented 4 years ago

I am completely new to using a network simulator. I just have started with ns-3. Could you guide me a bit as to how I can test picoquic on a network simulator?

huitema commented 4 years ago

@AbhishiktaP Not sure about NS3. @madi223 -- What simulation setup are you using for the tests that you describe in issue #943 ?

huitema commented 4 years ago

I asked the question to Quic developers, and the suggestion is to use TC on Linux. You may look at https://daniel.haxx.se/blog/2010/12/14/add-latency-to-localhost/

huitema commented 4 years ago

More suggestions: rmarx 8:46 AM we've used tc + netem, mininet and ns-3 (quic interop runner also uses ns-3). Most of that is docker based, and open source here: https://github.com/moonfalir/quicSim-docker (warning though: very rough :wink:)

rmarx 8:57 AM On windows, I've seen https://github.com/WPO-Foundation/win-shaper being used, on Mac dummynet. This might be a bit more userfriendly on linux/mac as well: https://www.sitespeed.io/documentation/throttle/ (edited)

Tom Jones 10:30 AM @huitema we use dummynet on freebsd, on mac they could use "network link conditioner" from the apple developer tools

My windows colleagues also have a driver that implement these functions if you are using WIndows.

huitema commented 4 years ago

More info: mpiraux 2:18 PM For Linux, I would recommend ns-3 (with direct code execution or using QNS) over mininet. mininet does emulation while ns-3 does simulation. More importantly, we've found a number of oddities and errors which can greatly impact the results within mininet over the course of several phds within our lab :)

madi223 commented 4 years ago

@AbhishiktaP Not sure about NS3. @madi223 -- What simulation setup are you using for the tests that you describe in issue #943 ?

Hello @huitema ,

My setup is based on VMs interconnected via OVS. I'm using linux tc in the middle for emulating the bottleneck link (token bucket filter for BW and netem for loss and delay). All my VMs (server,client,middle-boxes) run on top Qemu/KVM in a single Host.

AbhishiktaP commented 4 years ago

Thank you.

timprepscius commented 3 years ago

Hello @huitema,

I've been reading through your library. I have this use case question, it relates to this thread. Probably this is all obvious stuff after I dig in deeper.

I need to send in order reliable data, and I need to send out of order unreliable data. The unreliable data should take preference over reliable data.

This seems like it is handled. And that perhaps the easiest way would be to establish 2 connections, one for reliable and one for unreliable. True? It would be nice if this were all one one connection, would I be reassembling the reliable data stream myself- or ... Is it obvious which packets are unreliable?

The reliable data packets are larger than the MTU. (It would be nice if some unreliable data was larger than the MTU although I have not done this as of yet).

Does picoquic do the fragmentation?

I would like to have all connections originating and receiving through a single socket: -- Computer A connects to B&C&D through socket S1.
-- Computer A also listens for connections through socket S1.
-- Computer A can establish multiple connections to B using the same socket.

I believe this is handled. Yes?

Encryption is strong, there can be "pre keys" which are provided when the connection is created: -- Computer A uses pre-key X when starting connection to computer B, connection has id #1234 -- Computer B watches for id #1234 and when hearing it, it uses the same pre-key X to establish the connection

Is this possible?

--

This is a really amazing library. Thank you for all the time you've spent on it!

huitema commented 3 years ago

@timprepscius you can certainly handle all that with a single connection. But first, I would like you do dig deeper than your statement, "I need to send out of order unreliable data". Do you actually need the data to be unreliable? If there were no packet losses at all, all the data would be delivered. Would that be wrong? Would the application fail?

The simplest design would be to have one connection, open one reliable stream for the reliable data, and use the "datagram" API to send your unreliable data as independent packets. This will get pretty much everything you want: the reliable data will be delivered in order, without any limit to the size of application messages; the "unreliable" data will be delivered best effort, and will be sent with higher priority than the reliable stream.

The main limitation of this simple design is that the datagram frames must be smaller than the MTU. You mention then that "It would be nice if some unreliable data was larger than the MTU", and you cannot do that with the simple datagram design. The alternative is to send the "unreliable" application messages in separate QUIC streams, one stream per message. This will guarantee that the messages can have arbitrary lengths. It will keep some characteristics of the "unreliable" design: streams can be received in arbitrary order, e.g., stream 12 may well be received before stream 8. That means no head of queue blocking: if a stream packet is lost, the next stream will be received and immediately delivered.

There are two limitations with that "stream per message" design. First, you will have to ensure that the "unreliable" streams are served before the reliable one. The priority API might help achieving that. Then, you will observe that QUIC will try to reliably deliver all streams. This may well be just fine, since after all packet loss rates are small, and re-transmissions will only consume a small share of the transmission capacity. If that is a problem, we need to ensure that the sender of the messages can use the "unreliable delivery" API for the "unreliable streams".

private-octopus / picoquic

Out of order stream data delivery #741