istio / istio.io

Source for the istio.io site
https://istio.io/
Apache License 2.0
762 stars 1.52k forks source link

ambient resource saving blog on istio.io #13250

Closed linsun closed 1 month ago

linsun commented 1 year ago

Needs to be updated to use more realistic sample apps.

https://github.com/istio/istio.io/pull/13179

kdorosh commented 1 year ago

@GregHanson can you add standup-style update here async?

GregHanson commented 1 year ago

still in process of updating perf scripts for online-boutique example - focusing on xds evolution work

linsun commented 1 year ago

Started to look into this, hope to have the app running today!

linsun commented 1 year ago

updating scripts for this, got the app running with test completion. @GregHanson can send the data he has when test finishes.

linsun commented 1 year ago

Got some results appears to be consistent. May need help validate if traffic is intercepted by waypoints.

For L4: CPU is about 80% saving, mem is 99.5% saving.

4 namespaces, each has its own boutique app. each pod has 3 replicas.

kdorosh commented 1 year ago

we have proper runs now, he nixed the original blog and waiting for john's feedback on this when he gets online

kdorosh commented 1 year ago

will be working on code review comments on the blog today

GregHanson commented 1 year ago

blog content updated here: https://github.com/istio/istio.io/pull/13179

kdorosh commented 1 year ago

john likes these numbers a lot better. ran another test where size of environment (pods svcs etc) was doubled and percentages scale ~linearly

kdorosh commented 1 year ago

ran more tests yesterday, wanted a single node run. results similar, number of nodes does not effect. pinged john for review

kdorosh commented 1 year ago

keith mattix has an intern also running some performance numbers, don't think it will impact the blog. lin will be reviewing the blogs today hopefully

linsun commented 1 year ago

Left a bunch of comments, the blog needs to be clarified and precise on points imho.

linsun commented 1 year ago

Addressed most of Lin's comments. Greg will do another review on percentage consistency.

kdorosh commented 1 year ago

synced with lin last week; most changes already addressed. likewise prepared for the hoot tomorrow on the same topic. today will work to get a live env/demo ready for tomorrow

linsun commented 1 year ago

prep for hoot, getting env up running.

key q: unclear why online boutique app workloads use different cpu/mem comparing with sidecars vs ambient.

kdorosh commented 1 year ago

bug in query was root cause for odd results, but results are largely unchanged so the blog is still good. just need to update some charts with the new data

kdorosh commented 1 year ago

Lin added one more comment, greg to address today. Otherwise ready to ship. Just wording in the blog, no rerun tests

kdorosh commented 1 year ago

better ambient diagrams were requested

will see if blog needs the diagrams or if we can move forward without

linsun commented 9 months ago

@GregHanson to setup a time with @craigbox and Lin to discuss next steps for the blog.

craigbox commented 9 months ago

I'm fine with this content going into a blog if it's useful to have it out now, but I've expressed an interest that we have a very strong (and reproduceable) "cost of ambient vs. sidecars and other mesh models" page on istio.io in Q1. I would hope that this blog post would be almost all the source material required for that!

linsun commented 9 months ago

Able to sync with Andrea Ma on accessing the baremetal env. Pls reach out to Ihor directly to resolve machine issue.

linsun commented 9 months ago

Got access - should be able to run the performance test once time permits

linsun commented 9 months ago

Need to wait till Andre finishes the 1.20 performance tests.

linsun commented 8 months ago

Got some successful run on CNCF env, having trouble to view grafana dashboard for containers.

GregHanson commented 8 months ago

There is an issue with cadvisor when K8S is installed on the CNCF hardware, see issue here

Kubernetes removed the container metrics from cadvisor in their kubelet process. It appears KinD, k3d and GKE have restored support for this (example). There is a KEP in k8s upstream that keeps getting pushed out: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2371-cri-pod-container-stats/README.md#metricscadvisor

Workaround: Deploy own cadvisor daemonset, and configure prometheus to scrape metrics from the new source:

kubectl apply -f -<<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: cadvisor
  name: cadvisor
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app: cadvisor
  name: cadvisor
rules:
- apiGroups:
  - policy
  resourceNames:
  - cadvisor
  resources:
  - podsecuritypolicies
  verbs:
  - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app: cadvisor
  name: cadvisor
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cadvisor
subjects:
- kind: ServiceAccount
  name: cadvisor
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: docker/default
  labels:
    app: cadvisor
  name: cadvisor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: cadvisor
      name: cadvisor
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        app: cadvisor
        name: cadvisor
    spec:
      automountServiceAccountToken: false
      containers:
      - args:
        - --housekeeping_interval=10s
        - --max_housekeeping_interval=15s
        - --event_storage_event_limit=default=0
        - --event_storage_age_limit=default=0
        - --enable_metrics=app,cpu,disk,diskIO,memory,network,process
        - --docker_only
        - --store_container_labels=false
        - --whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace
        image: gcr.io/cadvisor/cadvisor:v0.45.0
        name: cadvisor
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 800m
            memory: 2000Mi
          requests:
            cpu: 400m
            memory: 400Mi
        volumeMounts:
        - mountPath: /rootfs
          name: rootfs
          readOnly: true
        - mountPath: /var/run
          name: var-run
          readOnly: true
        - mountPath: /sys
          name: sys
          readOnly: true
        - mountPath: /var/lib/docker
          name: docker
          readOnly: true
        - mountPath: /dev/disk
          name: disk
          readOnly: true
      priorityClassName: system-node-critical
      serviceAccountName: cadvisor
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: node-role.kubernetes.io/controlplane
        value: "true"
        effect: NoSchedule
      - key: node-role.kubernetes.io/etcd
        value: "true"
        effect: NoExecute
      volumes:
      - hostPath:
          path: /
        name: rootfs
      - hostPath:
          path: /var/run
        name: var-run
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /var/lib/docker
        name: docker
      - hostPath:
          path: /dev/disk
        name: disk
---
apiVersion: v1
kind: Service
metadata:
  name: cadvisor
  labels:
    app: cadvisor
  namespace: kube-system
spec:
  selector:
    app: cadvisor
  ports:
  - name: cadvisor
    port: 8080
    protocol: TCP
    targetPort: 8080
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: cadvisor
  name: cadvisor
  namespace: kube-system
spec:
  endpoints:
  - metricRelabelings:
    - sourceLabels:
      - container_label_io_kubernetes_pod_name
      targetLabel: pod
    - sourceLabels:
      - container_label_io_kubernetes_container_name
      targetLabel: container
    - sourceLabels:
      - container_label_io_kubernetes_pod_namespace
      targetLabel: namespace
    - action: labeldrop
      regex: container_label_io_kubernetes_pod_name
    - action: labeldrop
      regex: container_label_io_kubernetes_container_name
    - action: labeldrop
      regex: container_label_io_kubernetes_pod_namespace
    port: cadvisor
    relabelings:
    - sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: node
    - sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
      replacement: /metrics/cadvisor
    - sourceLabels:
      - job
      targetLabel: job
      replacement: kubelet
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app: cadvisor
EOF

updated helm install commands:

helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--version 55.5.1 \
--namespace monitoring \
--create-namespace \
--values - <<EOF
alertmanager:
  enabled: false
kubeStateMetrics:
  enabled: false
nodeExporter:
  enabled: true
kubelet:
  enabled: true
prometheus:
  prometheusSpec:
    ruleSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
EOF
linsun commented 7 months ago

TODO: need to open a branch in istio.io to continue https://github.com/istio/istio.io/pull/13179

istio-policy-bot commented 1 month ago

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2024-01-31. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

linsun commented 1 month ago

Closing this as i am not actively working on it. I don't think Craig is either. Please reopen if needed @craigbox