projectcontour / contour

Contour is a Kubernetes ingress controller using Envoy proxy.
https://projectcontour.io
Apache License 2.0
3.7k stars 672 forks source link

internal/grpc: properly handle Envoy ACK/NACK protocol #1176

Open davecheney opened 5 years ago

davecheney commented 5 years ago

Envoy's stream protocol states that after each DiscoveryResponse message the subsequent DiscoveryRequest will contain a ACK or NACK for the previous response. Contour currently doesn't pay attention to this ACK/NACK protocol, but for better compliance with the XDS spec we should.

phylake commented 5 years ago

NACKs happen (at least) when config validation fails. Envoy erroneously throws exceptions in its config constructors (example) which will leak memory (Destructors aren't run if exceptions are thrown in the ctor body. See Item 10 in More Effective C++ for more detail) for anything allocated in the constructor body. In the worst case Contour continues to send the same resources and gets NACKs back while Envoy leaks memory until it's oom-killed. If memory leaks and how much will be dependent on which object in the deeply nested config classes throws an exception and where.

Contour needs to mitigate this for xDS with resource hints by not sending the resource anymore if its NACKed. I don't know what mitigation it could do for LDS/CDS other than keep the previous set of resources around so they can be sent if the most recent version is NACKed.

davecheney commented 5 years ago

For 0.15 I've landed #1350 and I plan to leave this issue here for the moment.

Sending invalid updates to Envoy is a problem -- we shouldn't do that, but I don't know what to do with the signal once we did send an invalid update. Logging is about the least worst thing we can do. Trying to cordon the offending update is complicated, it depends on which table it came through. Crashing contour is a non starter, so logging is about the best I can think to do at the moment. Hopefully logging at error level will give us a signal from the field which updates are causing the problem and we'll know if its something we can systematically filter out on the Contour side or if we need to build a framework to trace NACK'd responses from Envoy back to their original k8s object.

jpeach commented 4 years ago

xref https://github.com/envoyproxy/go-control-plane/issues/46

sunjayBhatia commented 3 years ago

Crashing contour is a non starter, so logging is about the best I can think to do at the moment. Hopefully logging at error level will give us a signal from the field

Have we gotten any signal on this from users? Another option in addition to logging would be to increment a metric to make this issue more visible to operators

jpeach commented 3 years ago

Crashing contour is a non starter, so logging is about the best I can think to do at the moment. Hopefully logging at error level will give us a signal from the field

Have we gotten any signal on this from users?

Generally, we have opted not to allow passing config straight through to Envoy due to the risk of generating errors. This has complicated development and prevented advanced users from taking advantage of some Envoy features directly.

Another option in addition to logging would be to increment a metric to make this issue more visible to operators

Crashing doesn't help clear the underlying condition that causes the NACK. Contour would just reschedule and then send the bad config again. The desirable behavior is to fail safe. In the context of a configuration system, what we would want to do is keep the last good configuration, then bubble up the error associated with the failure to apply the new configuration. The difficulty here is knowing what the last configuration was (or should be), and how to propagate the error (since there's not always a straightforward relationship between Envoy and Kubernetes resources).

youngnick commented 3 years ago

In order to implement what James is talking about, we will need to change the way we build the DAG and send it to Envoy, so that instead of building a DAG, and sending that to Envoy, we build a DAG, diff it against the last DAG, and send either the changes (which would allow us to do incremental xDS) or the whole config again (what we do now). In the case that the change is NACKed, then we would have to spray errors everywhere (in every object managed by Contour, in the logs, and a metric as well), to make it clear that nothing is getting updated until the issue is resolved.

In the meantime, we absolutely should log and increment a metric, to at least allow it to be possible to detect that something has happened. I don't think we can risk doing anything that may cause a NACK though until we have the above solution or something that does the same thing.

youngnick commented 3 years ago

Okay, I'm having a look into this, and I'm going to start by checking that it's still a problem. To do this, I'm going to do something like this:

youngnick commented 3 years ago

Okay, I ran the above process with both the "contour" and "envoy" xDS engines, and got the following results:

In both cases, I did not need to restart Envoy to have it pick up new changes.

So, this means that this process is a lot more forgiving than I thought. Which is great! But I think there's still more to investigate here.

youngnick commented 3 years ago

This one's waiting on @wrowe to have some bandwidth. Moving to parking lot 1.

izturn commented 1 year ago

4276 says if we will support it, we must solve this one first, but after read the all issues about this one, I had a question as follows:

  1. With https://github.com/envoyproxy/envoy/pull/4787, envoy add a new config, its slow down the "while true" likes loop and lead to OOM no more.
  2. @youngnick's comments in the issues such as:

    so when the config is turn valid, the operation of restart contour is needed no more

After read the all above links and go through the contour's source code, I don't know what's the real problem about NACK at now? It seems that what we need to do is just let the user know there is a misconfigure?

sunjayBhatia commented 1 year ago

We have stalled a bit on working on the ACK/NACK problem because as you say, we've found that we pretty much validate all user input anyways so we don't send invalid config to Envoy so it is very difficult to generate a NACK at all at the moment

An ACK is maybe more interesting to figure out, it could allow us to set some status on HTTPProxy/Route etc. that tells users when their config is actually in effect. However figuring out when all the nodes have ACKed configuration and making sure we can actually make sure updates do not cross each other and are processed sequentially is not necessarily super straightforward.

izturn commented 1 year ago

@sunjayBhatia sorry, I don't quite understand what you mean. Please explain it in more detail and frankly, thx

sunjayBhatia commented 1 year ago

when we send a configuration update to a fleet of Envoys, they may return an ACK or NACK to signify if the update was accepted

Because we heavily validate user input, the configuration we generate in practice is never NACKed at the moment, so we have held off making much change here, it would be adding complication for not much benefit. Figuring out if a change was ACKed would be useful so we can signify to the user when exactly their configuration has taken effect (useful for something like knative etc.).

However, in the future, if we support WASM modules, Lua scripting, or something else that we cannot fully validate up-front and that Envoy could return a NACK for, we will need a way to signify to users that their resources are invalid. We've again held off on adding these sorts of features and "solving the NACK problem" this because of the below:

Figuring out if a specific change was ACK or NACKed is not actually super straightforward because of how we generate snapshots of xDS config to send to Envoy, changes in k8s resources basically get stacked on top of each other and sent asynchronously, e.g. if we have say a change to HTTPProxy A that generates a new snapshot-1 that is NACKed, and then a change to HTTPProxy B that generates snapshot-2 that is also NACKed, we don't currently have a good way to decipher if both changes to HTTPProxy A/B were valid or not since snapshot-2 includes the changes to both resources. Getting this to work is a larger piece to modify how we process k8s resource updates and send them along to Envoy, likely introducing some sort of blocking or a complex accounting system which is a change we have held off on making so far.

izturn commented 1 year ago

e.g. if we have say a change to HTTPProxy A that generates a new snapshot-1 that is NACKed, and then a change to HTTPProxy B that generates snapshot-2 that is also NACKed, we don't currently have a good way to decipher if both changes to HTTPProxy A/B were valid or not since snapshot-2 includes the changes to both resources.

I don't quite understand what you mean, accord to ACK and NACK semantics summary

The response_nonce field tells the server which of its responses the ACK or NACK is associated with.

The response_nonce can be used to distinguish which snapshot (1 OR 2) is wrong.

Maybe we can cache the latest valid snapshot, when the new one is misconfigured then restore it

sunjayBhatia commented 1 year ago

I don't quite understand what you mean, accord to ACK and NACK semantics summary

This doesn't have anything in particular to do with the xDS protocol, but rather how Contour generates the snapshots. We take into account the current state of all the k8s resources in the cluster.

In the example above:

The response_nonce can be used to distinguish which snapshot (1 OR 2) is wrong.

How we generate snapshots today, the changes from HTTPProxy A and B will both be applied to the DAG to generate snapshot-2 so NACKs will be for the aggregate of the changes, not an individual one

Maybe we can cache the latest valid snapshot, when the new one is misconfigured then restore it

This would involve caching the last good state of the DAG so we can apply the k8s resource changes on top of it, rather than the xDS config snapshot

In general to isolate what k8s resource change causes a NACK will involve some significant addition of accounting logic, caching of the DAG etc., and most likely blocking on rebuilding the DAG and sending resource updates until we get ACKs/NACKs, which we do not do today

A separate issue is what we should do on startup, if there are invalid resources in the cluster when Contour restarts, we have no prior valid configuration to fall back to so may end up incorrectly dropping traffic when we shouldn't

izturn commented 1 year ago

I/We'd like to implement/design it, but I think we need to make a community meeting to discuss the whole matter at first. @skriss @sunjayBhatia WDYT?

sunjayBhatia commented 1 year ago

We're planning on starting up community meetings again soon, with an ad-hoc meeting coming next week or the week after. We will announce it on slack and the community mailing list: https://groups.google.com/g/project-contour

izturn commented 1 year ago

Glad to hear that

github-actions[bot] commented 3 weeks ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

You can:

Please send feedback to the #contour channel in the Kubernetes Slack