Closed kyessenov closed 6 years ago
There are larger implications of this than simply eager computation. Long story short, work is happening to improve the caching and the move to v2 has mitigated several of these issues with caches, since we reuse computed config for an envoy if it reconnects.
@ZackButcher commented on Fri Oct 27 2017
Today Pilot implements Envoy's v1 xDS APIs, which are pull based: Envoy sends a request, we compute the config it needs (cache it), then reply to Envoy. As a result of the current implementation, we don't keep track of which Istio config was used to construct Envoy's config. This means we have poor caching (we invalidate the entire config cache when any Istio config changes), and means we don't know what Envoy configs need to be updated as Istio configs change.
In the world of v2 xDS APIs, Pilot will need to push config to Envoy, rather than send config as Envoy requests it. This means Pilot needs to understand which specific Istio config documents contribute to the config of each specific Envoy instance, so it can update and push config for specific Envoy instances as the underlying Istio configs change. As a side effect of tracking this usage, we can also improve our caching of config by invalidating only portions of the cache rather than the entire cache any time an Istio config changes.
@ZackButcher commented on Fri Oct 27 2017
Unfortunately I don't have permissions to add labels or update assignees. cc @rshriram @kyessenov
@kyessenov commented on Fri Oct 27 2017
Something to keep in mind is that the cache has to keep track of old versions to be able to stage updates across xDS. We should remove stale versions only when no proxy is utilizing it.
@andraxylia commented on Fri Oct 27 2017
Before we make this optimization, we have to check if it is needed, determine the breaking point, the number of pods or the conditions justifying it.
@ZackButcher commented on Fri Oct 27 2017
To be clear, the optimization is in the caching. Regardless of caching/performance, we need to invert config generation so its eager so that we're ready for the push-based world of xDS v2.
@kyessenov commented on Fri Oct 27 2017
To @andraxylia 's point, we're stuck with v1 until we can prove v2 performs better. The work will have to happen in a separate envoy adapter module (
proxy/envoy2
) since this is a large effort.@ZackButcher commented on Fri Oct 27 2017
How are we sequencing things? I thought we had to move to v2 APIs to address the 404 issues, which means we need to land updated config handling in 0.3 as well.
@andraxylia commented on Fri Oct 27 2017
My point was about avoiding premature optimization, and relative to the caching/invalidation of the config. I would defer this to when/if needed. We still need to prepare the config for each Envoy instance.
We are not stuck with v1, we want to move to v2, and we do not need to prove it performs better than v2, just that the pilot implementation of v2 is production ready. We should also have a way to switch from v1 and v2 and revert if needed, so @kyessenov suggestion is relevant.
Though ADS will solve the 404 issues, they can also be handled separately, in case v2 is not ready with production grade quality for 0.3.
Sequencing wise, I thought #1257 will cover everything, but if we need multiple issues and multiple owners, let's use an epic.
@kyessenov commented on Fri Oct 27 2017
The work has moved to a new envoy org project as there is a lot of complexity in the push/streaming interface. It's definitely far from a single issue/epic. I can't vouch that we can reach parity with v1 since there are some serious semantic differences (deprecation of
bind_to_port
that need to be reconciled).@rshriram commented on Fri Oct 27 2017
@andraxylia how do you envision the 404s can be solved with v1? I would be interested in exploring that solution.
@costinm commented on Sat Oct 28 2017
There are several ways to solve the 404. The most direct is to fix Envoy, so it returns 503 as expected. AFAIK there is no technical reason to return a 404 if a route exists but the cluster is not yet available.
The solution that I am proposing is to change the proxy agent, and have it download the full config before starting envoy, and act as a local caching proxy. The primary goal of this will be to handle cases where the pilot becomes unavailable, in particular for hybrid/onprem cases.
Obviously, such a cache could directly expose a v2 API - converting from v1 into v2 is apparently what Envoy is doing internally as well if I understood it correctly. And the code can be extremely simple - only depends on grpc, http/json and the interfaces defined by envoy.
It is also ok to start with a v1 cache in the agent - only special behavior is to not return routes or listeners until all dependent clusters/routes have been downloaded. On top of this it may allow some local overrides and/or templates.
Even if envoy is fixed to not return 404 on miss-config - it is likely we'll still need to properly deal with cases where pilot becomes unavailable, cascading failures can be pretty bad.
As a side note - the local cache + proxy can also greatly simplify push and reduce the complexity of pilot.
@kyessenov commented on Sat Oct 28 2017
I think using the agent to serve xDS is an interesting solution for stickiness of the updates. The agent can sequences the polled updates in the order that Envoy would always have internally consistent view of the config. The logic to sequence updates is required in v2 implementation, so this is just moving it closer to the proxy.
The downside is the increase in the complexity in the agent. We've been fighting for simplification of the agent since it's attached to every single pod.
Another tricky bit is lack of ACK/NACKs in v1. The server does not receive confirmation that the update is applied in v1 API.
@rshriram commented on Sun Oct 29 2017
It is also taking us back to the state 8 months ago when we did this in the agent. I don’t see any advantage of double caching in the proxy agent and Envoy. Envoy automatically falls back to the cached version if pilot is unavailable. That has been our design since day 1. So the argument about helping in hybrid/onprem doesn’t seem convincing. I also don’t see how it will reduce complexity (if any) in pilot because we don’t do any sequencing in pilot. Implementing v2 api in the agent seems suboptimal, more like I see no use for the v2 apis in Envoy for large scale installations.
Finally, not everyone will use the agent. Because they could be running Envoy in different formats (cloudfoundry for example) where there is no accompanying agent. This could be a temporary solution but not a permanent one.
@mattklein123 / @htuch do you think @costinm’s concerns are real ? - about envoys failing upon failure of management planes (like pilot or the one at lyft or what you guys are building at google). Do you guys have double caching, with local agent and a central agent? I don’t find it necessary.
@htuch commented on Sun Oct 29 2017
@rshriram no need for this local cache. As you correctly state, if a management update fails, the last valid config remains latched and in effect.
@mattklein123 commented on Sun Oct 29 2017
Agreed, there is no reason to cache anything on the node.
re: 404 vs. 503, there are legitimate reasons why someone may want to return 404 in this scenario. Having an option to return 503 instead in the case the cluster is not found is trivial. Someone just needs to do the work (https://github.com/envoyproxy/envoy/issues/1820).
@costinm commented on Mon Oct 30 2017
Envoy returning 503 plus the 'liveness' check at startup will take care of the 404 bug, and I think it is the right solution.
Regarding 'cache on the node' - there are reasons beyond the 404, and it is not the same with what we had 8 months ago. Envoy does indeed gracefully deal with pilot unavailability - but in case of a DDOS it is likely that both pilot and some of the ingress and backends will flip and restart. I've seen this happen quite a few times, and avoiding cascading failures is not easy.
Envoy could save snapshots on disk, or maybe it already saves snapshots of the config in the shared memory - and that would also be a good solution, but I don't think we should dismiss this problem, DDOS are common and there are other scenarios in hybrid/on-prem where similar behavior may happen.
In addition having an on-prem caching proxy for pilot will help in other use cases too - and likely we'll need to implement it anyways.
Either way - it's worth trying.
On Sun, Oct 29, 2017 at 6:41 PM, Matt Klein notifications@github.com wrote:
@costinm commented on Mon Oct 30 2017
One extra note on the subject - in a hosted environment, with very large number of domains ( like 'custom domains' in the case of a hosted app ), keeping the entire routing/listeners in pilot's memory becomes impossible, and it's not really needed since envoy already caches the stable config (and agent combined with some memcache could also do it on a local level).
So long term I don't think the current design of keeping the entire mesh config cached in Pilot's memory is viable, it will need to be sharded and locally cached/distributed, and Envoy will probably need to support more 'on demand' loading of config and certs.
The use case is something like a an app that allows users to bring their own domains ( something like hosted gmail for example ) - a single backend, but with 1M+ routes and certs.
@rshriram commented on Mon Oct 30 2017
I don't think
Envoy returning 503 plus the 'liveness' check at startup will take care of the 404 bug, and I think it is the right solution.
takes care of 503s or 404s fully. Everytime you change a route rule, you will hit this and the chances of hitting this depends on your polling interval. You can solve this by having a temporary local cache, but thats a short term solution, until V2 api lands - at which point the agent becomes useless.The envoy crash happens if we restart it too frequently - which is caused by auth certificates changing rapidly. I think its a question for the auth folks, and that too will become moot with v2 api. My two cents. We could also totally rearchitect Pilot (in my mind this solution is basically what we had in Amalgam8 a year ago) and finish this up, call it production ready by 0.3 time frame. Whether this is more stable than the V2 implementation, is up for debate.
@andraxylia commented on Tue Oct 31 2017
@mattklein123 would you agree that adding support in Envoy for snapshotting/checkpointing current running config is the right thing to do to allow data plane to correctly function if the control plane is partitioned, and Envoy needs to restart or hot-restart on the same host? There will be no control plane to provide clusters and listeners, so traffic will be lost.
@mattklein123 commented on Tue Oct 31 2017
No. I think this is complexity that is not actually needed and is more likely to lead to an outage than the situation it is trying to cover for.
@costinm commented on Tue Oct 31 2017
Shriram: the 503 on route rule change happens only when the cluster is missing, i.e. for a brand new service. AFAIK it doesn't happen if you change rules for an existing service/cluster.
Having a local cache is not useless - the main use case is hybrid and HA cases. In particular since Matt indicated he doesn't believe a checkpoint in Envoy is the right solution, as cache is even more important to have, and will be needed with v2 as well. My proposal for the cache was to have it expose v2 from start anyways ( transcoding the pilot v1 ).
I'm not aware of any crash caused by cert change - sorry if I missed that bug.
But I really don't see how we can call anything brand new 'production ready' - the whole point is to have extensive and broad testing and proven stability, we can't 'rearchitect' pilot and immediately call it 'production ready'. And there is no need to do this in 0.3 - a liveness check at startup is still needed even with v2 API ( we don't want requests coming in before v2 API got the config ), and even with v2 API we want 503 if a route is defined but not the cluster.
On Mon, Oct 30, 2017 at 10:25 AM, Shriram Rajagopalan < notifications@github.com> wrote:
@htuch commented on Tue Oct 31 2017
Also, Envoy may be deployed in a diskless scenario (or read only storage), so there inherently needs to exist a story for configuration recovery that doesn't rely on this.
I think a useful first step in this discussion is to get a clear picture (and consensus) on what the failure scenarios that Istio considers important are. There have been various suggested, can we get a doc written?
@costinm commented on Tue Oct 31 2017
I agree snapshot/cache is complexity that may not be needed in envoy, and is better handled by the DS infrastructure.
I've been involved in a few outages caused by this kind of cascading failure ( and maybe caused a couple as well :-) - the goal is to gracefully deal with cases where a component is down. Mixer already handles this (fails open), envoy partially handles it by using last known config - but only until restart.
The cache doesn't have to be implemented in the agent, and it doesn't need to store the cached config on local disk, but
We can start a doc - the caching proxy can be an optional component as well, will be more valuable to users who run hybrid clusters and are concerned about HA, there is no need to add it for all users and at this point the liveness + 503 should be sufficient for 0.3 ( IMHO ). We'll document that after adding a service it may take up to refresh_interval until it becomes visible.
On Tue, Oct 31, 2017 at 10:22 AM, htuch notifications@github.com wrote:
@rshriram commented on Tue Oct 31 2017
On Tue, Oct 31, 2017 at 1:29 PM Costin Manolache notifications@github.com wrote:
Yes and that why all the restarts are being eliminated. With v2, there is no restart.
Not in agent or disk. Means the DS server. And server has the cache.
Let’s use the word carefully. Hybrid means on prem plus cloud. I don’t know what you mean here. And what outage scenario as Harvey indicates. Having a crystal clear fault model helps. What failures can be handled and what cannot.
and are concerned about HA, there is no need to add it
You get 503 (or 404) anytime you create a new weighted routing rule and the receiving Envoy doesn’t have the clusters associated with the routes. It’s not for just brand new services.
It also makes no sense to change pilot architecture completely, do local fetch and v2 api locally and Call this production ready. I would say that we live with high refresh rates in v1 and make it configurable in Envoy to return 404/503.
We'll document that after adding a
Once again this is not the only scenario. Set up a 60s refresh interval. Create a weighted routing rule and you will see the issue.
@costinm commented on Wed Nov 01 2017
Will move the discussion on how to 'fail open', HA and avoiding cascading failires to separate document, it's a different topic.
On the 404: I think it is clear that 503-if-cluster missing and liveness checks will mitigate some of the problems and will both be needed even with v2. While v2 may return a full and consistent config - it doesn't avoid the interval between startup and getting the initial config, nor the case where a cluster is simply missing due to replication delays.
The main remaining issue is the case Shriram describes - which I don't understand and I was not able to reproduce if the service is created in advance. Can you provide more details/example ? And why would the 404 happen on 'weighted routing', if all the services and clusters are created and propagated before the routing rule is applied ? The bookinfo samples are targeted for demo, and typically don't wait for service/cluster propagation before applying route rules, but in a production environment it is quite reasonable to not route traffic to a new service immediately.
@rshriram commented on Wed Nov 01 2017
While v2 may return a full and consistent config - it doesn't avoid the interval between startup and getting the initial config,
Needs to be solved with health checks.nor the case where a cluster is simply missing due to replication delays.
Thats the whole point of ADS where we get to sequence updates to envoy over a single connection.Getting a 404 for a newly created service is acceptable. After all, the configuration has to propagate to all callers, and if the external world clients start accessing it before its done, a 404 is what a server sends. In other words, end users are actually used to this scenario. What they are not used to is getting a 404 when they split an existing service into multiple clusters for the purpose of version routing. And this is the bigger issue that we need to solve. We would not be undertaking such a big effort just for the healthcheck or initial startup thing :).
https://github.com/istio/pilot/issues/926 is the first occurrence of this issue (this user is trying to deploy Istio to production :) ). The problem only compounds when we increase the refresh interval from 1s to 60s. While a partial solution is to have high refresh rate, it burns a lot of CPU and tanks performance (evidence: https://github.com/istio/pilot/issues/1237 where we found "First tested with Istio 0.1.6. Big performance jump when refresh interval was increased from 10ms to 60s. " ).
@costinm commented on Wed Nov 01 2017
I see the comment that "Yes, a route rule can introduce a new cluster (for special version, for example)" - I'm not familiar enough with the weighted routing rules, do you have any pointers on where the new cluster gets introduced ? If we can decouple the creation of clusters / services from route changes - we would have a pretty good fix for this.
It seems we agree that healthchecks / liveness checks are needed for both v1 and v2 - and having some 503 when a new service is just added is ok.
Short refresh rate is not a solution - everything should work with 60s or more, with understanding that all route changes must wait for refresh_interval to avoid 503.
@rshriram commented on Wed Nov 01 2017
new clusters are created because of route changes (version split). An existing service's cluster is split into multiple newer clusters, and traffic is split by weights. This involves changing the route config to use weighted clusters and pushing those new clusters to envoy. And this is where the 404s/503s are occurring. and thats the whole reason for v2 push api.
@ldemailly commented on Thu Nov 02 2017
we should probably consolidate the various 404/503 open issues and move this to istio/istio
https://github.com/istio/istio/issues/1038 https://github.com/istio/istio/pull/1041 https://github.com/istio/old_pilot_repo/issues/1422
@costinm commented on Thu Nov 02 2017
Thanks Shriram, I see the problem.
I'll start a doc - there are few relatively simple options on how to push clusters before routes (or ensure cluster stability when route rules are changed), but may require small extensions/adjustments to the API. The current weight rules don't work on 'mesh expansion' ( no labels ) - I was not familiar with how it is implemented, but at least one of the options we were exploring for adding weight rules for 'raw vms' will also solve this problem.
On Wed, Nov 1, 2017 at 11:09 AM, Shriram Rajagopalan < notifications@github.com> wrote:
@kyessenov commented on Thu Nov 02 2017
Why would you not have labels on VMs? k8s, consul, eureka and whatnot all have labels on workloads. An API change is a drastic measure to fix this problem. The actual solution will happen in v2 library under envoy org, once we work out the ADS protocol. This is the solution that will scale, be backwards/future compatible, and fix all the other "inconsistency" issues that are possible with v1. Once it's ready we'll replace the pilot core with it, and only need to translate istio config into data plane config to complete the picture.
@costinm commented on Thu Nov 02 2017
On Thu, Nov 2, 2017 at 5:25 PM, Kuat notifications@github.com wrote:
Well - hard to answer this question, non-orchestrated workloads don't have a lot of other things, they're simply processes running on some machines. That's how things work.
Even if they did have labels - it's not clear this model can work with non-flat networks or if it can scale in the hybrid model, or if all orchestration systems use labels in the same way, or if keeping track of all global labels and pods in pilot's memory is a scalable architecture.
The important feature is to allow users to control weighted routing.
In general - there are multiple solutions, each with its own benefits. ADS can fix the current 404 bug, there are other ways as well - but we need to also consider if the current design of weight network is the right one (regardless of the bug). Getting fixated on one solution or model and finding reasons to dismiss any other option is usually not working very well :-)
@kyessenov commented on Thu Nov 02 2017
I'll be happy to hear concrete proposals :) in the meantime I'm investing in building v2 implementation. It will be a separate decision how to reconcile the two approaches.
On Thu, Nov 2, 2017, 7:13 PM Costin Manolache notifications@github.com wrote: