istio / istio

Connect, secure, control, and observe services.
https://istio.io
Apache License 2.0
35.88k stars 7.74k forks source link

Istio proxy(1.11.3) memory leak #36710

Closed dltkr77 closed 2 years ago

dltkr77 commented 2 years ago

Bug Description

Istio proxy memory usage keeps increasing until an OOM error occurs. In the end, the memory of Istio proxy exceeds 2GB. The number of times the service is called is very small. (No more than 1 TPS)

Version

$ istioctl version
client version: 1.11.3
control plane version: 1.11.3
data plane version: 1.11.3 (29 proxies)

$ kubectl version --short
Client Version: v1.19.7
Server Version: v1.20.10-gke.1600

Additional Information

스크린샷 2022-01-05 오후 4 46 05

profile021.pdf

bug-report.txt

Affected product area

Is this the right place to submit this?

howardjohn commented 2 years ago

What type of traffic do you have? Any custom EnvoyFilters?

@lambdai I recall some long lived HTTP retry issue, is this the same symptoms?

dltkr77 commented 2 years ago

@howardjohn The types of traffic are as follows.

  1. Custom Metrics
  2. Traces
  3. Database connection (GCP Cloud SQL)

Custom EnvoyFilters are used other services. But, the service mentioned in the issue is not using it.

bianpengyuan commented 2 years ago

Would be great if you can provide info about how you customize your metric, and dump of the proxy stats endpoint (curl localhost:15000/stats/prometheus at istio-proxy container). This is probably because of unbounded tags.

dltkr77 commented 2 years ago

@bianpengyuan An example of the IstioOperator metrics is shown below. (Istio metrics)

    telemetry:
      v2:
        prometheus:
          configOverride:
            inboundSidecar:
              metrics:
                - name: request_duration_milliseconds
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path

I used opentelemetry collector for custom metrics. It works as a sidecar in the Pod. (using OpenCensus interface)

See the file below for a dump of the proxy stats endpoint. stats.txt

bianpengyuan commented 2 years ago

Looks like url, client_type and client_name are all unbounded. It is anti-pattern to add any unbounded tags to the metric, which could be the root cause to your envoy's memory since number of time series envoy keeps in memory grows over time. Those unbounded information is better to kept in logs instead of metric.

dltkr77 commented 2 years ago

@bianpengyuan Thank you for answer. I agree with what you said. However, since the above tags are used for internal calls, only a few kinds of tags are used. In this situation, is the memory leak due to unbounded tags?

bianpengyuan commented 2 years ago

It could be. best way to verify is to see if removing those customization would stop the leak. If not then maybe something else, and we need further debug.

dltkr77 commented 2 years ago

I'll try removing the unbounded tags and verify. Thank you.

dltkr77 commented 2 years ago

I removed unbounded tags and I see no memory usage increase.

Before remove unbounded tags (memory usage) 스크린샷 2022-01-10 오전 8 55 56

After remove unbouneded tags (memory usage) 스크린샷 2022-01-10 오전 8 56 20

But, as mentioned above, the number of tags used doesn't make much difference.

Before remove unbounded tags (the number of series: 3) 스크린샷 2022-01-10 오후 4 12 12

After remove unbounded tags (the number of series: 12) 스크린샷 2022-01-10 오후 4 12 19

Why is there such a big difference in memory usage?

bianpengyuan commented 2 years ago

Will there be more time series if you remove the app label in query?

dltkr77 commented 2 years ago

Remove the app label and 1592 series are viewed.

bianpengyuan commented 2 years ago

Actually you are customizing request_duration_milliseconds not request_count

dltkr77 commented 2 years ago

Oh, sorry. It's a sample. Full configuration is as below.

telemetry:
      v2:
        prometheus:
          configOverride:
            inboundSidecar:
              metrics:
                - name: request_duration_milliseconds
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: request_bytes
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: response_bytes
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: requests_total
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
            outboundSidecar:
              metrics:
                - name: request_duration_milliseconds
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: request_bytes
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: response_bytes
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: requests_total
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
            gateway:
              metrics:
                - name: request_duration_milliseconds
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: request_bytes
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: response_bytes
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
                - name: requests_total
                  dimensions:
                    client_type: request.headers['x-clbs-client-type']
                    client_name: request.headers['x-clbs-client-name']
                    http_method: request.method
                    url: request.url_path
bianpengyuan commented 2 years ago

And I am not seeing any customized labels in your query result, or probably I missed them with eyeballing.. Would be easier to check if you copy the query result.

dltkr77 commented 2 years ago

Due to various tests, the situation captured above could not be maintained. Based on other cluster in a similar situation,

promql: istio_requests_total{app="trigger-dev"} result: 18 series

istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="mutual_tls", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="spiffe://cluster.local/ns/service/sa/trigger-dev", destination_service="trigger-dev.service.svc.cluster.local", destination_service_name="trigger-dev", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="POST", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="admin-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="admin-dev", source_cluster="Kubernetes", source_principal="spiffe://cluster.local/ns/service/sa/admin-dev", source_version="v2-0-0-dev", source_workload="admin-dev-v2-0-0-dev", source_workload_namespace="service", url="/trigger/internal/segment/job/historys", version="v2-0-0-dev"} | 497968
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="mutual_tls", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="spiffe://cluster.local/ns/service/sa/trigger-dev", destination_service="trigger-dev.service.svc.cluster.local", destination_service_name="trigger-dev", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="unknown", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="0", response_flags="unknown", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="admin-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="admin-dev", source_cluster="Kubernetes", source_principal="spiffe://cluster.local/ns/service/sa/admin-dev", source_version="v2-0-0-dev", source_workload="admin-dev-v2-0-0-dev", source_workload_namespace="service", url="unknown", version="v2-0-0-dev"} | 0
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="10.36.9.78:13133", destination_service_name="InboundPassthroughClusterIpv4", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="0", response_flags="DC", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/", version="v2-0-0-dev"} | 7
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="10.36.9.78:13133", destination_service_name="InboundPassthroughClusterIpv4", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/", version="v2-0-0-dev"} | 13898
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="10.36.9.78:13133", destination_service_name="InboundPassthroughClusterIpv4", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="503", response_flags="UC", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/", version="v2-0-0-dev"} | 3
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="10.36.9.78:13133", destination_service_name="InboundPassthroughClusterIpv4", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="503", response_flags="UF", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/", version="v2-0-0-dev"} | 2
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="trigger-dev.service.svc.cluster.local", destination_service_name="trigger-dev", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="0", response_flags="DC", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/health", version="v2-0-0-dev"} | 8
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="trigger-dev.service.svc.cluster.local", destination_service_name="trigger-dev", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/health", version="v2-0-0-dev"} | 13952
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/attributes/cluster-name", version="v2-0-0-dev"} | 2
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/hostname", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/id", version="v2-0-0-dev"} | 2
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/zone", version="v2-0-0-dev"} | 2
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/project/project-id", version="v2-0-0-dev"} | 5
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/attributes/container-name", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/attributes/namespace-id", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/machine-type", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/name", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/latest/dynamic/instance-identity/document", version="v2-0-0-dev"} | 1
bianpengyuan commented 2 years ago

Hmm I am wondering if there is any regression at metric customization code path.. Have you used this customization at version before 1.11? Was there any leak with version before? One thing it would be great if you can try out and help us to pinpoint the issue is to only customize the metric with a bounded dimension like request_method:http_method: request.method and see if it still leaks. If that still leaks, that suggests something wrong with the metric customization path.

dltkr77 commented 2 years ago

Thank you. I believe there is no regression in the customization code path. And I didn't use customization like this in previous versions. I'll run some tests related to the memory leak and comment again if I think it's a bug or I have any questions.

dltkr77 commented 2 years ago

The above tests were conducted in an environment that was not strictly controlled, so there was confusion. I'm using the opentelemetry collector for custom metrics. (https://github.com/open-telemetry/opentelemetry-collector) As a result of the final check, the memory issue occurred when using the grpc protocol in the opentelemetry agent/collector. (Regardless of unbounded tags)

There is an issue I found on the envoy side, could it be related to this? https://github.com/envoyproxy/envoy/issues/15904

  1. Configuration where the memory issue occurred. 스크린샷 2022-01-25 오전 10 35 22 스크린샷 2022-01-25 오후 1 04 16
  1. Configuration with no memory issue. 스크린샷 2022-01-25 오전 10 35 35 스크린샷 2022-01-25 오후 1 04 28
bianpengyuan commented 2 years ago

There is an issue I found on the envoy side, could it be related to this? envoyproxy/envoy#15904

Yeah seems like it.

dltkr77 commented 2 years ago

The comment of the envoyproxy/envoy#15904 issue says to configure the overload manager. Is it possible to configure the overload manager in Istio? If I have to use gRPC, how do I configure it in Istio?

bianpengyuan commented 2 years ago

You should be able to configure it with EnvoyFilter, which allows you to customize Bootstrap.

dltkr77 commented 2 years ago

I will close this issue because the cause of the issue and how to solve it have been revealed. I'll reopen it if necessary. Thank you!