Scale issues in istio-pilot discovery service, affecting alpha perf targets and installation

kyessenov commented 7 years ago

with about 10 rules, it's ~50% CPU; at 30 rules, proxies start dropping requests.

GregHanson commented 7 years ago

What CPU and memory requests/limits did you use in testing? I used

    spec:
      containers:
      - name: manager
        image: {{.hub}}/runtime:{{.tag}}
        imagePullPolicy: Always
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"

With 50 rules: Exec'ing into the discovery container, from top -c I saw about 22-25% CPU from the envoy process. Using minikube dashboard with heapster, discovery was using .24 cores.

With 100 rules: top -c : 35-40% dashboard: .47 cores

Not sure if there have been some optimizations since the issue was originally created, but I figure we should document the scale/performace setup and test cases so we can keep track of these figures periodically going forward

kyessenov commented 7 years ago

Can you confirm that the rules are used by checking the envoy configs (or RDS output)? Just want to make sure that rules are not filtered. My setup may have been bad since I realized that k8s clusters can go into a baseline state with high CPU usage. For the perf testing we should try to create fresh clusters to deal with this problem.

GregHanson commented 7 years ago

Hmm, not sure. The rules I created with istioctl are along the lines of:

type: route-rule
name: productpage-default1
spec:
  destination: productpage.default.svc.cluster.local
  precedence: 1
  route:
  - tags:
      version: v1
---
type: route-rule
name: reviews-default2
spec:
  destination: reviews.default.svc.cluster.local
  precedence: 1
  route:
  - tags:
      version: v1

repeated over and over with different names. The output doesn't change when I call the RDS api, but I can watch the CPU usage drop or spike as I create or delete. So a majority of the rules are jsut repeats, but RDS api still needs to process them all

GregHanson commented 7 years ago

In regards to your create fresh clusters comment, since I am using minikube, are you suggesting doing minikube delete && minikube start prior to each test? Or just to insure there are no items in other namespaces in use during performance testing?

ayj commented 7 years ago

Unless I'm missing something, (very) roughly speaking the number of RDS/CDS/SDS queries from envoy grows exponentially with number of services and pods, e.g. M pods serving N services where M > N roughly translates to M envoy instances querying for N services via RDS/CDS/SDS O(MN). Adding more rules further increases the cost.

Having specific perf targets here would be useful (see https://github.com/istio/manager/issues/309), but we probably need some shorter term solution to carry us over since https://github.com/istio/manager/issues/309 is marked BETA and we've already observed failed RDS/CDS/SDS queries in the current integration test. Possibly some combination of (1), (2), and (4) below would work.

General solutions to the problem:

(1) - increase RefreshDelayMs from 1 second (envoy defaults to 30).
(2) - increase number of replicas in manager discovery deployment
(3) - cache most recent RDS/CDS/SDS (per-node) in discovery service itself
(4) - add dedicated HTTP accelerator in front of discovery pod (e.g. varnish)
(5) - disable RDS/CDS/SDS and limit restarts to no more than once every X seconds
(6) - add support for h2 server push to envoy and discovery service to avoid unnecessary queries.

kyessenov commented 7 years ago

There is also work in Envoy to use GRPC for DS. Then we can do something similar to (6).

ayj commented 7 years ago

Thanks for the reminder - ref gRPC streaming APIs for SDS/CDS/RDS/LDS/load reporting

Also, (7) move discovery service to per-pod (or per-node) to distribute SD load

kyessenov commented 7 years ago

@GregHanson I was using different "header match" prefix conditions to make sure that all rules end up in the final config. But @ayj is right about the real issue here, which is the total computation cost O(# services # pods #route rules) and the load is not spread between RDS pods.

GregHanson commented 7 years ago

@kyessenov what do you imagine for the test environment, Google Cloud? Is the plan to use Heapster for metric recording during tests?

kyessenov commented 7 years ago

If we want to measure perf of the components, a single node minikube should be OK. If we want to see how real world apps behave, then a multi-node cluster is preferable. If we can do OK with a single node cluster for big apps, then that is sufficient and it's unnecessary to test on a multi-node setup. We'll need to measure several dimensions, not all provided by heapster:

CPU/mem usage of *DS depending on # rules, # services, # pods
latency/qps for requests, HTTP, TCP
CPU usage of proxy/agent

We can also stress test an app by increasing the network load, to find the limits and see how things start to fall apart.

GregHanson commented 7 years ago

In Amalgam8, we used a similar setup to here: https://www.simonholywell.com/post/2015/06/parallel-benchmark-many-urls-with-apachebench/ in our request-handling benchmarks - perhaps a similar setup can be used?

ayj commented 7 years ago

Current plan for alpha is some combination of (1), (2), and (3). (1) and (2) should be configurable via manager flags (and eventually ConfigMap) and k8s replicas. I'm working on a simple version of (3) which performs in-process cache of most recent discovery responses. Out of process cache (number #4 above) could also be explored if basic alpha targets can't be met, but might require plumbing through certs for mTLS between proxy and manager (edit: or adding sidecar proxy in front of discovery service).

andraxylia commented 7 years ago

7) Or is it the same as 6?: The root cause of this issue seems to be that envoy polls istio-manager every 1s. Increasing the value will offer just temporary relief, same as caching. Can this be fixed by re-architecting, such as having the istio-manager push the config changes to the istio-proxy, and envoy poll istio-proxy (if polling is the only way)?

2) Horizontal scaling: this is something we definitely need to do, sooner or later, as long as a single instance can support a decent number of apps (waiting for the numbers). It is not correct to add more replicas or to force an istio installation per namespace to hide scaling issues.

ayj commented 7 years ago

Yes, root cause is pull vs. push. The gRPC streaming API addresses that fundamental problem but isn't ready yet. The current assumption is that some combination of (1), (2), (3) will buy us sufficient time to avoid having to implement a more complex workaround and then reimplement it yet again against gRPC. Hopefully the proposed benchmarks for alpha will validate whether this assumption is true or not. And if it isn't, we have better justification for re-architecting configuration distribution as you suggested.

I'm curious why you say it is not correct to add more replicas? The DNS addon with kube-dns-autoscaler seems to work that way.

andraxylia commented 7 years ago

I was saying it is correct to add replicas to scale, this is a fundamental way of scaling, but it is not correct to add replicas to hide bugs or a flawed design. For instance, we would not want to add an istio-manager instance for each app deployed, or for each namespace, as it has been suggested.

ayj commented 7 years ago

I don't believe anyone suggested having istio-manager per-app or per-namespace in this thread, but maybe I'm missing something? The current plan is specifically to keep discovery centralized with some stop-gap optimizations until push-based gRPC API is available (i.e. a better architecture option).

ayj commented 7 years ago

istio-manager discovery performance is sufficient for e2e integration and smoke tests. Moving to backlog for now until we complete benchmarking for alpha.

ayj commented 7 years ago

I spent a bit of time load testing SDS. CDS/RDS peek CPU% will likely be higher due to more complex request processing, but steady-state CPU% should be similar due to caching.

Setup

GKE 50 node cluster with machine type _n1-standard2
25 services/deployments with single port (HTTP)
Increased discovery refresh delay from to 5 seconds (1-to-10s effective polling frequency)
Increase manager CPU request to 500m

Benchmark

Scale deployment replicas from 0 to 20 (0-5 pods per core, 0-500 pods total.
Observe CPU usage both while deployment is scaling up/down and in stead state
Observe SDS failures via envoy stats.

Observations

Kubernetes would not schedule 500 pods with default 100m container resource limit. Reducing default limit to >50m helped. This should be okay for this particular benchmark since individual proxy container load is minimal (<20m) and we aren't benchmarking QPS/Latency.
Steady state CPU% was <500m CPU with 500 pods.
CPU% temporarily spiked to >1500m CPU when all deployments are concurrently scaled from 0 to 20 replicas each. This might not be an issue in practice unless operators frequently deploy 500+ pods all at once and expect immediatly service availability.
No SDS failures observed when scaling from 0-to-500 pods.

rshriram commented 7 years ago

What is 500m ? Btw, thanks for benchmarking this!! Would you be able to try the following setup: 10s for SDS, 60s for CDs and RDS? On Tue, May 2, 2017 at 11:49 PM Jason Young notifications@github.com wrote:

I spent a bit of time load testing SDS. CDS/RDS peek CPU% will likely be higher due to more complex request processing, but steady-state CPU% should be similar due to caching. Setup

GKE 50 node cluster with machine type n1-standard_2

25 services/deployments with single port (HTTP)

Increased discovery refresh delay from to 5 seconds (1-to-10s effective polling frequency)

Increase manager CPU request to 500m

Benchmark

Scale deployment replicas from 0 to 20 (0-5 pods per core, 0-500 pods total.

Observe CPU usage both while deployment is scaling up/down and in stead state

Observe SDS failures via envoy stats.

Observations

Kubernetes would not schedule 500 pods with default 100m container resource limit. Reducing default limit to >50m helped. This should be okay for this particular benchmark since individual proxy container load is minimal (<20m) and we aren't benchmarking QPS/Latency.

Steady state CPU% was <500m CPU with 500 pods.

CPU% temporarily spiked to >1500m CPU when all deployments are concurrently scaled from 0 to 20 replicas each. This might not be an issue in practice unless operators frequently deploy 500+ pods all at once and expect immediatly service availability.

No SDS failures observed when scaling from 0-to-500 pods.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/istio/manager/issues/292#issuecomment-298817652, or mute the thread https://github.com/notifications/unsubscribe-auth/AH0qd1ZXmsiqC7bVdOzutxv7F59mKoM-ks5r1_k9gaJpZM4MZDfD .

--

~shriram

ayj commented 7 years ago

What is 500m?

50% of a CPU core using Kubernetes resource terminology (see Meaning of CPU).

Would you be able to try the following setup: 10s for SDS, 60s for CDs and RDS?

Sure. 60s would be up to 120s of propagation time. We'd need to test 30s if we want max 60s.

rshriram commented 7 years ago

Related to #1237

andraxylia commented 7 years ago

We will have to tackle this seriously in 0.3, changed it to a P0.

sdake commented 6 years ago

subscribe

kyessenov commented 6 years ago

Issue moved to istio/istio #1485 via ZenHub

istio / old_pilot_repo