quicwg / datagram

In-progress version of draft-ietf-quic-datagram
https://quicwg.org/
31 stars 8 forks source link

Allow a Sender to Control Datagram ACKs #42

Closed nibanks closed 3 years ago

nibanks commented 3 years ago

In discussions with a few parties using MsQuic we've come along scenarios were ACKs for datagrams were either not necessary or they should (almost must) be not sent until some other data was being sent as well. While thinking through these, the simplest solution I've been able to come up with would be for a sender to indicate to the peer that it should not treat datagram frames as ACK eliciting. How do folks feel about adding another (optional) transport parameter to this spec, when present indicates DATAGRAM frames are not ack eliciting. Obviously, the parameter is simply ignored if the peer does not advertise support to receive the frames.

nibanks commented 3 years ago

@LPardue brought up a good point on Slack:

since datgrams are congestion controlled, if they don't get ACK'd, what would happen?

My response:

Good point. A couple of thoughts:

In scenarios where the app is periodically (N datagrams a second) sending data, the ACKs will still be exchanged. Also, any other ACK eliciting data would trigger the exchange of ACKs as well.

But we still have the possible scenario of an app sending datagrams in a single direction, with no other feedback in response. I see two possible solutions:

wegylexy commented 3 years ago

How does not acking affect idle timeout? i.e. client keeps sending datagrams, server never acks.

nibanks commented 3 years ago

https://datatracker.ietf.org/doc/html/rfc9000#section-10.1

An endpoint restarts its idle timer when a packet from its peer is received and processed successfully. An endpoint also restarts its idle timer when sending an ack-eliciting packet if no other ack- eliciting packets have been sent since last receiving and processing a packet. Restarting this timer when sending a packet ensures that connections are not closed after new activity is initiated.

So receiving these packets would reset the idle timeout. Sending them would not, because they aren't ACK-eliciting.

DavidSchinazi commented 3 years ago

We've been able to ship the DATAGRAM frame in production without this feature, so I'd suggest this should be written up as an extension to datagrams.

LPardue commented 3 years ago

+1

Datagrams has been tightly functionally scoped since adoption to the group. We have some IETF adopted protocols on top of QUIC that haven't seemed to need the suggested feature. Adding an optimization feature just as we're getting ready to ship has a bad track record. I think we'd want to see very strong evidence of usefulness and implementer intent of such a feature before considering it for inclusion.

nibanks commented 3 years ago

I think we'd want to see very strong evidence of usefulness and implementer intent of such a feature before considering it for inclusion.

Fair enough. We intend to implement because the usefulness of this extends beyond HTTP scenarios to very general scenarios, including:

  1. Embedded/IoT/portable devices where network usage increases translate to battery life decreases
  2. Devices on metered networks where network usage increases translate to higher monetary costs

Given how general these scenarios are, I think we should take them into account within the main spec and not delay this for another document. While the spec may be sufficient for HTTP scenarios as-is, it would be a shame for a general transport spec to require an additional extension so it can be optimized for its use by non-HTTP callers who care more about the packet overhead.

Would a repro of the packet reduction before and after the change be helpful to illustrate the impact of the change?

LPardue commented 3 years ago

I could be lacking imagination but I'm not seeing why these problems are unique to DATAGRAM as currently defined. Rather it seems like the ability to reduce any QUIC traffic can benefit such use cases.

As an example, we've encountered something similar with the recommended ack policies in the base drafts and how something like ack-frequency can improve upon that for a range of documents.

Seeing some number would be informative because right now this problem seems nebulous and the proposed solution a bit vague.

mjoras commented 3 years ago

This seems like a fundamental change to the premise underlying the design of DATAGRAM frames. The choices to make them congestion controlled and ACK-eliciting were pretty deliberate, and I think making that optional serves to confuse the dependency graph of other drafts depending on DATAGRAM, as @LPardue alludes to.

Controlling ACKs is definitely an interesting problem, but I don't think this is pressing enough or specific to DATAGRAM to be in the DATAGRAM spec. One way to work around this, as discussed, is to make a new DATAGRAM type which doesn't elicit ACKs (more closely matching UDP datagram behavior). Such a thing is something that belongs in another extension, IMO.

nibanks commented 3 years ago

I agree that the general problem of "how do we coalesce ACKs to minimize extra packets" could be solved several ways. The ACK frequency extension is one way possible way (I initially started there, in fact). And yes, the problem could be more generalized to data beyond DATAGRAMS, but there's a natural reason that at least two completely separate groups have come to me and asked for this for just DATAGRAMS: the data is unreliable and the apps have no direct need for the acknowledgements. I don't see this kind of behavior being as necessary for reliable data exactly because it is reliable. It needs to be retransmitted on loss. Therefore you need to determine loss and need those ACKs.

Assuming then that we would only want to solve this for DATAGRAM, the question is "How?". Relying on the ACK frequency extensions has complications because we want to control the behavior based on the presence of the DATAGRAM frame. It seems overly complicated to say "Please use this ACK frequency parameters, unless it's a DATAGRAM only frame, and then use these instead." That produces a pretty complicated implementation and API configuration surface. Configuring the interpretation of how you acknowledge DATAGRAM frames seems simpler: elicit an ACK or not. Can there be issues with congestion control? Yes. Can't you have exactly the same problem if you configured too high of an ACK frequency parameter set {1s delay; 1000pkt threshold)? Haven't heard any problems about that discussed with ACK frequency.

I hoped to discuss possible solutions to the problem here. If the solutions ends up requiring a new document, then so be it. I can write a new document. I would like folks' opinions on the best way they think to solve this though.

DavidSchinazi commented 3 years ago

Before we discuss solutions in the context of this draft, we should discuss problems in the context of this draft. The fact that there are multiple interoperable deployments of this drafts that haven't experienced this problems indicates that this problem is not applicable to all uses of this draft.

nibanks commented 3 years ago

I agree that the document works as is, for existing HTTP based scenarios/deployments, but as I understand it, the QUIC WG charter is no longer limited to purely HTTP based workloads. Therefore, is "what we have works fine for HTTP" still an acceptable reason to reject proposals related to non-HTTP problems?

Just to restate the main problem: Battery operated embedded devices that periodically exchange DATAGRAM frames do not want to pay the power cost of waking up to send/receive the ACK-only packets that inevitably get exchanged, because the delayed ACK timer is less than the DATAGRAM send period. But simply increasing the delayed ACK timers has a negative side effect on reliable data that is occasionally exchanged. I understand that we don't want to have feature creep, but IMO, this is not an unreasonable problem to solve with this draft.

The simplest solution I've come up with is to allow for making DATAGRAMs not ACK-eliciting, as indicated in #44. Implementations (IMO) should be able to trivially implement the receiver part of this new TP (just treat it as an ACK or PADDING frame), and if they don't expect to ever send it, they don't need to add any extra code for it.

DavidSchinazi commented 3 years ago

@nibanks this isn't about HTTP. We also have a VPN over QUIC DATAGRAM in production that works well without your proposed change. I think your proposal is in scope for the QUIC WG (though I'll defer to the chairs to make that call), it's just not necessarily in scope for this document. The DATAGRAM extension is general-purpose, whereas your proposal is constrained to a specific class of device. Figuring out how to solve your issue will take time, and I really don't think we should delay the publication of this document because of your specific constrained use-case.

LPardue commented 3 years ago

The proposal max_datagram_no_ack leaves open a big question about interoperability. If such a client finds it important not to have to do DATAGRAM-related ACK sending or receiving, then it's not clear what they do when a server doesn't advertise that max_datagram_no_ack. There's going to be other considerations like this that people would figure out experience - waiting on that before declaring this document done is unfair to the people that are happy with the work we adopted in the first place.

Wearing no hats, I think this use case has the potential to be interesting work worth addressing. If that requires QUIC protocol changes or extensions, the QUIC WG is best placed to do it. It doesn't sound like DATAGRAM prevents future work in that area.

nibanks commented 3 years ago

One comment I have related to possibly putting this in a separate document: It would then require two new transport parameters instead of just 1:

This would probably double the implementation cost because of the additional negotiation logic that would be required to implement along with what already exists for normal datagrams. The feature support matrix gets a lot more complicated.

The proposal max_datagram_no_ack leaves open a big question about interoperability.

I don't quite follow the problem you outline here @LPardue. If my proposal were accepted, at a minimum an implementation that supports DATAGRAMs would have to support receiving the new TP, and simply modify their ACK logic accordingly. Implementations that have no need to send the TP don't need to do anything further.

LPardue commented 3 years ago

The problem statement is ambiguous then. You say that clients don't want to wake up to send or receive ACKs. If a server supports not ACKing the client DATAGRAM but doesn't send the extension to stop the client ack'ing server DATAGRAMs, is that ok?

nibanks commented 3 years ago

It's a unilateral extension where one side says "Don't ACK my datagarms" so they don't have to wake up just to process an ACK for the datagram. For protocols where it makes sense, both sides might enable it.

LPardue commented 3 years ago

Thanks for the clarification. That seems to put a lot of onus on the receiver of the TP to do work. I.E you'd be asking server implementers to do work even if they have no intention to run deployments that requires this feature.

nibanks commented 3 years ago

Yes, a receiver would have some burden, even if they don't have a corresponding scenario that enables the feature themselves. As I see it, the following changes are required (at a minimum):

  1. Add decoding logic for the new payload-less TP.
  2. Add a new connection-wide flag, that is set when the new TP is successfully decoded.
  3. When processing received DATAGRAM frames, read the flag and elicit an ACK accordingly.

IMO, this is not "a lot of onus", especially if you compare it to some of the complexity involved in things like migration, ECN or SPA, which, strictly speaking, aren't explicitly required for a deployment to work, but still require a certain amount of work.

wegylexy commented 3 years ago

I think it is rather the reverse: server can opt to tell clients not to expect acks for datagrams that the client may send. This is essential for servers hosted in the cloud where egress is metered and ingress is free. If an old client doesn't support this, it will just assume the datagram is lost as usual. When this is enabled, the server will not ack datagrams at all, not even bundle with other frames. The app may already use stream to control the datagram payload from clients, e.g. server tells clients to lower voice quality via a control stream.

nibanks commented 3 years ago

@wegylexy yes, you could go that route, but I think it better for the sender of the DATAGRAMs to be in control, because only they know if they need ACKs or not.

MikeBishop commented 3 years ago

An alternative way to spell this is analogous to PADDING/PING -- have an ack-eliciting-DATAGRAM and a non-ack-eliciting-DATAGRAM codepoint.

nibanks commented 3 years ago

Yes, that would be an acceptable approach. To expand a bit on the design/differences:

  1. No new TP; just defines two code point for the two type of frames: DATAGRAM & DATAGRAM_NO_ACK.
  2. DATAGRAMs are ACK-eliciting; DATAGRAM_NO_ACKs are not.

If folks think that's easier to implement on the receiver side, I'd be fine with that as well.

LPardue commented 3 years ago

typed my answer but got overtaken by events, but posting any way

The alternative I offered in slack was was to define a new frame type called DATAGRAM_NO_ACK that acts very much like DATAGRAM except it is not ack-eliciting. Endpoints advertise their willingness to receive the frame in a TP, if your peer doesn't support it then you know to either fall back to a less optimized DATAGRAM (maybe with some ACK tuning) or you terminate everything. The requirement on what to do with that individual frame is clear. The onus shifts to the sender to make sure they use the frames appropriately. It also allows a sender to mix in a regular DATAGRAM to ellicit acks when its needed, avoiding the need for pings.

I think this goes to show that even iff the WG were to agree to solve this problem, we'll take time to agree on the solution acceptable to everyone with an opinion, implementation or deployment concerns.

migration, ECN or SPA, which, strictly speaking, aren't explicitly required for a deployment to work, but still require a certain amount of work.

In those examples, the peer can't force the endpoints to use the feature. Those examples are also complicated and require page(s) of text to explain the expectations, tradeoffs etc. That's why I'm concerned about tying up the DATAGRAM progress with something that might have considerations we don't know about.

DavidSchinazi commented 3 years ago

+1. None of these proposals have anywhere near as much deployment as existing DATAGRAM. We should experiment with these, but not in a way that delays shipping the datagram document.

LPardue commented 3 years ago

One way to approach experimentation would be to collect the problem statement and the different proposals into a single I-D, and solicit feedback to gain a sense of whether the community shares the understanding of the problem and has any strong opinion for one of the proposals, or indeed has other ideas.

tfpauly commented 3 years ago

Yes, I think having a new document with a problem statement and proposal would be best here.

I'd point out that the document does give implementations a fair amount of leeway:

Receivers SHOULD support delaying ACK frames (within the limits specified by max_ack_delay) in reponse to receiving packets that only contain DATAGRAM frames, since the timing of these acknowledgements is not used for loss recovery.

So, implementations can choose to be fairly lazy in sending ACKs, and can configure a high max_ack_delay. If this really isn't sufficient, I think we need a new proposal.

wegylexy commented 3 years ago

@tfpauly What about low delay for streams but high delay for datagrams? And priority of packets and their acks between streams and datagrams?

tfpauly commented 3 years ago

An implementation certainly could ACK with larger delays for DATAGRAM-only packets, and shorter delays for packets with STREAM frames. The document already suggests this. However, since ACKs are for packets, not frames, I don't see how prioritization comes into play: once I receive a STREAM frame I can ACK more quickly, which then also covers any DATAGRAM-only packets I had received.

tfpauly commented 3 years ago

@nibanks are you okay to say this is not in scope for the main doc, and take it as a separate discussion in the WG?

nibanks commented 3 years ago

In my opinion, this should be included in the core datagram extension. There are two main questions in play here:

Should a DATAGRAM sender be allowed to control the ACK behavior?

There are many reasons that a QUIC-based protocol would want to control the ACK behavior:

  1. Low power (IoT) devices that have no need/desire to wake up to process an ACK (e.g. sensor periodically sending some state).
  2. A need to customize the ACK batching for non-power related reasons, such as more expensive ingress/egress traffic.
  3. Generally, more flexibility for existing UDP-based protocols to be ported to QUIC.

As already mentioned, packets with DATAGRAM frames are congestion controlled, so senders must be warry and account for this. IMO, there a several ways to handle this and should not be a blocker.

If the WG agrees that we should support this, then the following, more contentious, question comes in:

Should all DATAGRAM receivers be REQUIRED to support this?

In other words, can we make this feature optional (i.e. put it in a separate extension) or not? The only part that would be required if this was added to the DATAGRAM spec is the receiving part, because an implementation doesn't have to send things it doesn't use itself; but it does have to ACK them.

Arguments for making it required:

Arguments against making it optional:

Responses to arguments that have been made against requiring it:

Existing deployments don't need it.

The spec should take into account more than the current deployments' usages. Other protocols actively looking to use QUIC will use it.

We require interop'ing code before requiring anything.

IMO, as things stand currently, this is effectively a requirement to interop with an HTTP/3 stack. As far as I know, beyond MsQuic, there are no other deployed QUIC-based protocols beyond HTTP/3. All the existing production implementations and deployments involve HTTP/3. So if there is a feature that existing HTTP/3 stacks have no interest in, there's no way to get interop with those stacks because they won't implement the feature.

Taking all this into account, my strong preference is to make this a requirement of the core DATAGRAM extension, ideally along the lines of https://github.com/quicwg/datagram/pull/45.

tfpauly commented 3 years ago

Personally, I think that the use case of QUIC being optimized for IoT devices to the point that it can be used for sending beacon packets without any ACKs or congestion control is out of scope of this document, and belongs as an extension. I think that kind of use case goes beyond just adding unreliable frame support, and needs a lot more advice and changes made to the protocol as a whole, particularly around loss recovery and congestion control.

As such, I do not believe this should be a required feature for unreliable DATAGRAM support.

nibanks commented 3 years ago

Thanks for the feedback Tommy.

can be used for sending beacon packets without any ACKs or congestion control is out of scope of this document

I think it's an important distinction to tease out: Congestion control still applies 100%. No changes there. Additionally, we aren't doing "without any ACKs". Congestion control will still limit the amount that can go into the network. We will still require ACKs to remove "bytes in flight" to free up the CC window.

QUIC being optimized for IoT devices ... is out of scope of this document

I'd like to clarify this statement further, if possible. Are you saying that in general QUIC for IoT devices is out of scope or were you referring specifically to the ACK/CC stuff (discussed above)? If IoT in general, there is nothing explicitly IoT for this from a protocol stand point. Additionally, TCP, UDP, and TLS don't have a "here's the IoT specific bits" associated specs or extensions that I know of. Why should QUIC be any different? If the statement was specific to the ACK/CC stuff, please see my above comments.

that kind of use case goes beyond just adding unreliable frame support

Assuming you ignore the statement about changing CC (which I am not proposing) and that ACKs are still generally required for all other aspects of the protocol, I strongly believe being able to modify that ACK behavior of the peer for datagrams is in scope for this spec. I see a similar correlation between PING/PADDING in the core spec.

Ralith commented 3 years ago

As far as I know, beyond MsQuic, there are no other deployed QUIC-based protocols beyond HTTP/3. All the existing production implementations and deployments involve HTTP/3.

FWIW, quinn implements features regardless of their relevance to HTTP/3, and has non-HTTP/3 users, and considers good datagram support a priority. We're a bit behind on interop setup, though.

tfpauly commented 3 years ago

If we are still requiring that DATAGRAMs contribute to bytes in flight, but are allowed to not be ACK'ed, we get into the warnings that are around PADDING:

To avoid a deadlock, a sender SHOULD ensure that other frames are sent periodically in addition to PADDING frames to elicit acknowledgments from the receiver.

Now, it seems unlikely that a QUIC sender would only ever be sending PADDING packets to the point where the a deadlock occurs, but it's quite reasonable to only send DATAGRAM packets for quite a while. Thus this would require that senders also send PINGs to elicit ACKs, and be very careful about doing so to avoid deadlocks.

The other way to minimize ACK overhead is to instead delay ACKs, which is what the document already suggests. max_ack_delay can go up to 16 seconds if that's how a specific application wants to configure things. For the use cases you have, is there a hard requirement to have ACKs spread out longer than 16 seconds? The spec can support a truly huge amount of ACK batching.

Also, regarding implementation, our implementation for the Apple stack (which is now public API) is QUIC-native without requiring HTTP/3, so I am certainly looking at use cases beyond HTTP/3.

nibanks commented 3 years ago

The problem with increasing delayed ACK is its effect on reliable data. IMO, this further argues for why something like this should be in the DATAGRAM spec, and not something else. We don't want to effect reliable data acknowledgements; just the unreliable data.

huitema commented 3 years ago

@nibanks if you want datagrams to not be acknowledged, then logically they should be exempted of congestion control. And then, I would really not want that to be a default or must-support option. If the application really depends on that behavior, why is it not using DTLS instead of QUIC?

nibanks commented 3 years ago

@huitema they are still acknowledged. They just don't always need to elicit an immediate ACK. As far as using DTLS, QUIC brings loads of features and improvements compared to DTLS when considering reliable data delivery. Apps that only need unreliable delivery aren't the target scenario; apps/protocols that need a mix a reliable and unreliable are.

tfpauly commented 3 years ago

@nibanks the ACK delay is the max ACK delay. Essentially, you can ACK STREAM frames faster if you want—you don't have to delay—but you can delay ACKs for DATAGRAMs for quite a while. This seems like it's very doable for specific applications.

LPardue commented 3 years ago

The chairs have been monitoring the discussion on this issue, PRs, and Slack.

The DATAGRAM draft was adopted as a simple extension to QUIC that operated within the existing constraints and premises of the core QUIC transport. The ACK-eliciting property of DATAGRAM frames was included in the version of the document that this WG chose to adopt. Treating them this way provides similarity to STREAM frames, which aids user expectations of application data behaviour, and is consistent with the design principle that DATAGRAM frames are subject to congestion control.

DATAGRAM is an extension to QUIC. Endpoints advertise their ability to receive DATAGRAM frames. Applications that build on top of QUIC need to define handling of this extension negotiation and failure conditions.

The stated goals for this proposal can, in part, be achieved by existing QUIC capabilities such as ACK delaying. Where there is a capability gap, the core QUIC transport and DATAGRAM draft do not prevent additional extensions that can fulfil the stated goals. It has been noted that there may be pain for applications that require specific QUIC extensions in order to meet an operational target. The chairs do not believe this problem is unique to DATAGRAM and is something that application will have to accommodate in the long term.

Several members of the WG have noted that changing or augmenting the ACK behaviour of DATAGRAM introduces complication for the design of this extension specifically related to congestion control. For what is a fairly short draft, accommodating a robust and complete solution to this proposal would likely require broad changes of design and editorial nature to this specification. Furthermore, incorporating such a design change as a mandatory part of DATAGRAM risks requiring IETF work items dependent on this specification to also exert effort to accommodate the change. The chairs note that the current design meets the operational targets of IETF protocols such as MASQUE and WebTransport. Meanwhile, while discussion for audio/video use-cases of QUIC seems to be picking up in the IETF (for example the non-WG MoQ list. It is still early in understanding how DATAGRAM meets or does not meet evolving application needs. Attempting to accommodate emerging needs puts at risk the WGs ability to complete work for dependent applications that already have their needs solved. This is a matter of trade-offs.

QUIC's strong versioning and extensibility support mean that it can be tailored to the needs or requirements of applications, without necessarily needing coordination between diverse implementers, endpoints, or the QUIC WG. When deciding what work to take on in the QUIC WG, the chairs consider interest from the WG members in both solving the problem and implementing the solution.

Based on our observations of the discussion on this issue, together with the progress status of the DATAGRAM specification, the chairs do not believe there is sufficient interest in the WG such that this problem needs to be solved as part of this document. The proposal therefore is to close the issue with no action. A consensus call will be sent to the mailing list.

The chairs would like to note that a non-ack eliciting DATAGRAM extension is in scope of the QUIC WG should the proponents wish to pursue the work as a separate item.

ianswett commented 3 years ago

I'm just coming into this discussion now, but I support not doing anything in the core datagram draft and instead punting this to the ack-frequency draft.

I already filed an issue for a frame that makes a packet non-ack-eliciting(https://github.com/quicwg/ack-frequency/issues/65), but there are other ways this could be done(ie: A very high 'ACK-Eliciting Threshold') as well.

janaiyengar commented 3 years ago

Just a note to the proponents who might still want to see this work happen elsewhere. As you work through building a case for this, I think it might help if you:

In general, I'll note that a receiver is free to do as it chooses, and it can unilaterally ack (or not) when it wants. The entire point of the ack-frequency draft is for the sender to tell the receiver of its tolerance, to put limits on how long the receiver might delay acks. You might consider building something with a combination of sender tolerance signaling (through ack frequency) and general receiver behavior (through application negotiation, or through a TP, to indicate that a receiver does not need to ack some things immediately).

LPardue commented 3 years ago

Consensus is to close this with no action on the DATAGRAM spec.