Closed dltkr77 closed 2 years ago
What type of traffic do you have? Any custom EnvoyFilters?
@lambdai I recall some long lived HTTP retry issue, is this the same symptoms?
@howardjohn The types of traffic are as follows.
Custom EnvoyFilters are used other services. But, the service mentioned in the issue is not using it.
Would be great if you can provide info about how you customize your metric, and dump of the proxy stats endpoint (curl localhost:15000/stats/prometheus at istio-proxy container). This is probably because of unbounded tags.
@bianpengyuan An example of the IstioOperator metrics is shown below. (Istio metrics)
telemetry:
v2:
prometheus:
configOverride:
inboundSidecar:
metrics:
- name: request_duration_milliseconds
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
I used opentelemetry collector for custom metrics. It works as a sidecar in the Pod. (using OpenCensus interface)
See the file below for a dump of the proxy stats endpoint. stats.txt
Looks like url, client_type and client_name are all unbounded. It is anti-pattern to add any unbounded tags to the metric, which could be the root cause to your envoy's memory since number of time series envoy keeps in memory grows over time. Those unbounded information is better to kept in logs instead of metric.
@bianpengyuan Thank you for answer. I agree with what you said. However, since the above tags are used for internal calls, only a few kinds of tags are used. In this situation, is the memory leak due to unbounded tags?
It could be. best way to verify is to see if removing those customization would stop the leak. If not then maybe something else, and we need further debug.
I'll try removing the unbounded tags and verify. Thank you.
I removed unbounded tags and I see no memory usage increase.
Before remove unbounded tags (memory usage)
After remove unbouneded tags (memory usage)
But, as mentioned above, the number of tags used doesn't make much difference.
Before remove unbounded tags (the number of series: 3)
After remove unbounded tags (the number of series: 12)
Why is there such a big difference in memory usage?
Will there be more time series if you remove the app
label in query?
Remove the app
label and 1592 series are viewed.
Actually you are customizing request_duration_milliseconds
not request_count
Oh, sorry. It's a sample. Full configuration is as below.
telemetry:
v2:
prometheus:
configOverride:
inboundSidecar:
metrics:
- name: request_duration_milliseconds
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: request_bytes
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: response_bytes
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: requests_total
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
outboundSidecar:
metrics:
- name: request_duration_milliseconds
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: request_bytes
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: response_bytes
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: requests_total
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
gateway:
metrics:
- name: request_duration_milliseconds
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: request_bytes
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: response_bytes
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
- name: requests_total
dimensions:
client_type: request.headers['x-clbs-client-type']
client_name: request.headers['x-clbs-client-name']
http_method: request.method
url: request.url_path
And I am not seeing any customized labels in your query result, or probably I missed them with eyeballing.. Would be easier to check if you copy the query result.
Due to various tests, the situation captured above could not be maintained. Based on other cluster in a similar situation,
promql: istio_requests_total{app="trigger-dev"} result: 18 series
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="mutual_tls", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="spiffe://cluster.local/ns/service/sa/trigger-dev", destination_service="trigger-dev.service.svc.cluster.local", destination_service_name="trigger-dev", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="POST", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="admin-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="admin-dev", source_cluster="Kubernetes", source_principal="spiffe://cluster.local/ns/service/sa/admin-dev", source_version="v2-0-0-dev", source_workload="admin-dev-v2-0-0-dev", source_workload_namespace="service", url="/trigger/internal/segment/job/historys", version="v2-0-0-dev"} | 497968
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="mutual_tls", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="spiffe://cluster.local/ns/service/sa/trigger-dev", destination_service="trigger-dev.service.svc.cluster.local", destination_service_name="trigger-dev", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="unknown", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="0", response_flags="unknown", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="admin-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="admin-dev", source_cluster="Kubernetes", source_principal="spiffe://cluster.local/ns/service/sa/admin-dev", source_version="v2-0-0-dev", source_workload="admin-dev-v2-0-0-dev", source_workload_namespace="service", url="unknown", version="v2-0-0-dev"} | 0
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="10.36.9.78:13133", destination_service_name="InboundPassthroughClusterIpv4", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="0", response_flags="DC", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/", version="v2-0-0-dev"} | 7
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="10.36.9.78:13133", destination_service_name="InboundPassthroughClusterIpv4", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/", version="v2-0-0-dev"} | 13898
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="10.36.9.78:13133", destination_service_name="InboundPassthroughClusterIpv4", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="503", response_flags="UC", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/", version="v2-0-0-dev"} | 3
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="10.36.9.78:13133", destination_service_name="InboundPassthroughClusterIpv4", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="503", response_flags="UF", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/", version="v2-0-0-dev"} | 2
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="trigger-dev.service.svc.cluster.local", destination_service_name="trigger-dev", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="0", response_flags="DC", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/health", version="v2-0-0-dev"} | 8
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="none", destination_app="trigger-dev", destination_canonical_revision="v2-0-0-dev", destination_canonical_service="trigger-dev", destination_cluster="Kubernetes", destination_principal="unknown", destination_service="trigger-dev.service.svc.cluster.local", destination_service_name="trigger-dev", destination_service_namespace="service", destination_version="v2-0-0-dev", destination_workload="trigger-dev-v2-0-0-dev", destination_workload_namespace="service", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="destination", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_cluster="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown", url="/health", version="v2-0-0-dev"} | 13952
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/attributes/cluster-name", version="v2-0-0-dev"} | 2
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/hostname", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/id", version="v2-0-0-dev"} | 2
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/zone", version="v2-0-0-dev"} | 2
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="200", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/project/project-id", version="v2-0-0-dev"} | 5
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/attributes/container-name", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/attributes/namespace-id", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/machine-type", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/computeMetadata/v1/instance/name", version="v2-0-0-dev"} | 1
istio_requests_total{app="trigger-dev", client_name="unknown", client_type="unknown", connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_cluster="unknown", destination_principal="unknown", destination_service="metadata.google.internal", destination_service_name="metadata.google.internal", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", http_method="GET", instance="10.36.9.78:15020", job="kubernetes-pods", kubernetes_namespace="service", kubernetes_pod_name="trigger-dev-v2-0-0-dev-79dd6b95fb-vpzfm", pod_template_hash="79dd6b95fb", reporter="source", request_protocol="http", response_code="404", response_flags="-", security_istio_io_tlsMode="istio", service_istio_io_canonical_name="trigger-dev", service_istio_io_canonical_revision="v2-0-0-dev", source_app="trigger-dev", source_canonical_revision="v2-0-0-dev", source_canonical_service="trigger-dev", source_cluster="Kubernetes", source_principal="unknown", source_version="v2-0-0-dev", source_workload="trigger-dev-v2-0-0-dev", source_workload_namespace="service", url="/latest/dynamic/instance-identity/document", version="v2-0-0-dev"} | 1
Hmm I am wondering if there is any regression at metric customization code path.. Have you used this customization at version before 1.11? Was there any leak with version before? One thing it would be great if you can try out and help us to pinpoint the issue is to only customize the metric with a bounded dimension like request_method:http_method: request.method
and see if it still leaks. If that still leaks, that suggests something wrong with the metric customization path.
Thank you. I believe there is no regression in the customization code path. And I didn't use customization like this in previous versions. I'll run some tests related to the memory leak and comment again if I think it's a bug or I have any questions.
The above tests were conducted in an environment that was not strictly controlled, so there was confusion. I'm using the opentelemetry collector for custom metrics. (https://github.com/open-telemetry/opentelemetry-collector) As a result of the final check, the memory issue occurred when using the grpc protocol in the opentelemetry agent/collector. (Regardless of unbounded tags)
There is an issue I found on the envoy side, could it be related to this? https://github.com/envoyproxy/envoy/issues/15904
opentelemetry agent
{{- range $.Values.namespaces }}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-agent-conf
namespace: {{ .name | title | lower }}
labels:
app: otel-agent
component: otel-agent-conf
data:
otel-agent-config: |
receivers:
opencensus:
endpoint: 0.0.0.0:55678
zipkin:
endpoint: 0.0.0.0:9411
zipkin/test:
endpoint: 0.0.0.0:9412
prometheus:
config:
scrape_configs:
- job_name: "app_infra"
scrape_interval: 10s
metrics_path: "/actuator/prometheus"
static_configs:
- targets: ['localhost:9999']
exporters:
logging:
loglevel: debug
zipkin/otelcol:
endpoint: "http://otel-collector.istio-system.svc.cluster.local:9411/api/v2/spans" # Replace with a real endpoint.
format: proto
opencensus/trace:
endpoint: otel-collector.monitoring.svc.cluster.local:55678
insecure: true
sending_queue:
enabled: false
num_consumers: 10
queue_size: 50000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 2m
opencensus/metrics:
endpoint: otel-collector.monitoring.svc.cluster.local:55678
insecure: true
sending_queue:
enabled: false
num_consumers: 10
queue_size: 50000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 2m
stackdriver/customname:
project: nm-prod-global-userdata-hub
processors:
memory_limiter:
{{- toYaml .memory_limiter | nindent 8 }}
extensions:
memory_ballast:
{{- toYaml .memory_ballast | nindent 8 }}
zpages: {}
health_check: {}
service:
extensions: [zpages, health_check]
pipelines:
traces:
receivers: [zipkin]
processors: [memory_limiter]
exporters: [opencensus/trace]
metrics:
receivers: [opencensus]
processors: [memory_limiter]
exporters: [opencensus/metrics]
{{ end }}
opentelemetry collector
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-conf
namespace: {{ .Values.namespace }}
labels:
app: otel-collector
version: {{ $.Chart.AppVersion }}
data:
otel-collector-config: |
receivers:
opencensus:
endpoint: 0.0.0.0:55678
zipkin:
endpoint: 0.0.0.0:9411
processors:
probabilistic_sampler:
hash_seed: 22
sampling_percentage: 15.3
span/custom:
name:
from_attributes: [http.url]
attributes/tracetag:
actions:
# The following demonstrates how to set an attribute on all spans.
# Any spans that already had `region` now have value `planet-earth`.
# This can be done to set properties for all traces without
# requiring an instrumentation change.
# - key: region
# value: "planet-earth"
# action: upsert
- key: "project"
value: "userdatahub"
action: upsert
- key: "http.url"
from_attribute: "/http/url"
action: upsert
- key: "x-request-id"
from_attribute: "guid:x-request-id"
action: upsert
- key: "/http/url"
action: delete
- key: "guid:x-request-id"
action: delete
tail_sampling:
decision_wait: 30s
num_traces: 500000
policies: # OR condition
[
{
name: namespace-data,
type: string_attribute,
string_attribute: {key: istio.namespace, values: [data, service, default]}
}
]
tail_sampling/2:
decision_wait: 5s
num_traces: 500000
policies: # OR condition
[
{
name: http-not-success,
type: string_attribute,
string_attribute: {key: http.status_code, values: [30*,40*,41*,50*], enabled_regex_matching: true}
},
{
name: delay-policy,
type: latency,
latency: {threshold_ms: 3000}
}
# ,
]
batch:
send_batch_size: 500000
timeout: 30s
batch/metrics:
send_batch_size: 1000
timeout: 60s
batch/appmetrics:
send_batch_size: 1000
timeout: 60s
memory_limiter:
check_interval: {{ $.Values.memLimit.checkInterval }}
limit_mib: {{ $.Values.memLimit.limitMib }}
filter/k8s:
metrics:
# include:
# match_type: regexp
# metric_names:
# - prefix/.*
# - prefix_.*
exclude:
match_type: regexp
metric_names:
- go_info
- istio_response_bytes
- istio_request_bytes
- istio_request_duration_milliseconds
extensions:
memory_ballast:
size_mib: {{ $.Values.memLimit.ballastSizeMib }}
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
exporters:
logging:
loglevel: debug
prometheus/istio:
endpoint: "0.0.0.0:9090"
const_labels:
project: userdatahub
zipkin:
endpoint: "http://zipkin.monitoring.svc.cluster.local:9411/api/v2/spans"
stackdriver/tracing:
project: nm-prod-global-userdata-hub
stackdriver/metrics_app:
project: nm-prod-global-userdata-hub
metric:
prefix: {{ $.Values.metric.prefix }}
skip_create_descriptor: true
service:
extensions: [pprof, zpages, health_check]
pipelines:
traces:
receivers: [opencensus, zipkin]
processors: [memory_limiter, tail_sampling, tail_sampling/2, attributes/tracetag, batch]
exporters: [zipkin]
metrics/app:
receivers: [opencensus]
processors: [memory_limiter]
exporters: [prometheus/istio]
opentelemetry agent
{{- range $.Values.namespaces }}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-agent-conf
namespace: {{ .name | title | lower }} #default, service, data에 설치해야 함
labels:
app: otel-agent
component: otel-agent-conf
data:
otel-agent-config: |
receivers:
opencensus:
endpoint: 0.0.0.0:55678
zipkin:
endpoint: 0.0.0.0:9411
zipkin/test:
endpoint: 0.0.0.0:9412
prometheus:
config:
scrape_configs:
- job_name: "app_infra"
scrape_interval: 10s
metrics_path: "/actuator/prometheus"
static_configs:
- targets: ['localhost:9999']
exporters:
logging:
loglevel: debug
zipkin/otelcol:
endpoint: "http://otel-collector.istio-system.svc.cluster.local:9411/api/v2/spans" # Replace with a real endpoint.
format: proto
opencensus/trace:
endpoint: otel-collector.monitoring.svc.cluster.local:55678
insecure: true
sending_queue:
enabled: false
num_consumers: 10
queue_size: 50000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 2m
opencensus/metrics:
endpoint: otel-collector.monitoring.svc.cluster.local:55678
insecure: true
sending_queue:
enabled: false
num_consumers: 10
queue_size: 50000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 2m
stackdriver/customname:
project: nm-prod-global-userdata-hub
otlphttp/metrics:
endpoint: http://otel-collector.monitoring.svc.cluster.local:4318
insecure: true
otlphttp/traces:
endpoint: http://otel-collector.monitoring.svc.cluster.local:4319
traces_endpoint: http://otel-collector.monitoring.svc.cluster.local:4319/v1/traces
insecure: true
processors:
memory_limiter:
{{- toYaml .memory_limiter | nindent 8 }}
extensions:
memory_ballast:
{{- toYaml .memory_ballast | nindent 8 }}
zpages: {}
health_check: {}
service:
extensions: [zpages, health_check]
pipelines:
traces:
receivers: [zipkin]
processors: [memory_limiter]
exporters: [otlphttp/traces]
metrics:
receivers: [opencensus]
processors: [memory_limiter]
exporters: [otlphttp/metrics]
{{ end }}
opentelemetry collector
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-conf
namespace: {{ .Values.namespace }}
labels:
app: otel-collector
version: {{ $.Chart.AppVersion }}
data:
otel-collector-config: |
receivers:
opencensus:
endpoint: 0.0.0.0:55678
zipkin:
endpoint: 0.0.0.0:9411
otlp/metrics:
protocols:
http:
endpoint: 0.0.0.0:4318
otlp/traces:
protocols:
http:
endpoint: 0.0.0.0:4319
processors:
probabilistic_sampler:
hash_seed: 22
sampling_percentage: 15.3
span/custom:
name:
from_attributes: [http.url]
attributes/tracetag:
actions:
# The following demonstrates how to set an attribute on all spans.
# Any spans that already had `region` now have value `planet-earth`.
# This can be done to set properties for all traces without
# requiring an instrumentation change.
# - key: region
# value: "planet-earth"
# action: upsert
- key: "project"
value: "userdatahub"
action: upsert
- key: "http.url"
from_attribute: "/http/url"
action: upsert
- key: "x-request-id"
from_attribute: "guid:x-request-id"
action: upsert
- key: "/http/url"
action: delete
- key: "guid:x-request-id"
action: delete
tail_sampling:
decision_wait: 30s
num_traces: 500000
policies: # OR condition
[
{
name: namespace-data,
type: string_attribute,
string_attribute: {key: istio.namespace, values: [data, service, default]}
}
]
tail_sampling/2:
decision_wait: 5s
num_traces: 500000
policies: # OR condition
[
{
name: http-not-success,
type: string_attribute,
string_attribute: {key: http.status_code, values: [30*,40*,41*,50*], enabled_regex_matching: true}
},
{
name: delay-policy,
type: latency,
latency: {threshold_ms: 3000}
}
# ,
]
batch:
send_batch_size: 500000
timeout: 30s
batch/metrics:
send_batch_size: 1000
timeout: 60s
batch/appmetrics:
send_batch_size: 1000
timeout: 60s
memory_limiter:
check_interval: {{ $.Values.memLimit.checkInterval }}
limit_mib: {{ $.Values.memLimit.limitMib }}
filter/k8s:
metrics:
# include:
# match_type: regexp
# metric_names:
# - prefix/.*
# - prefix_.*
exclude:
match_type: regexp
metric_names:
- go_info
- istio_response_bytes
- istio_request_bytes
- istio_request_duration_milliseconds
extensions:
memory_ballast:
size_mib: {{ $.Values.memLimit.ballastSizeMib }}
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
exporters:
logging:
loglevel: debug
prometheus/istio:
endpoint: "0.0.0.0:9090"
const_labels:
project: userdatahub
zipkin:
endpoint: "http://zipkin.monitoring.svc.cluster.local:9411/api/v2/spans"
stackdriver/tracing:
project: nm-prod-global-userdata-hub
stackdriver/metrics_app:
project: nm-prod-global-userdata-hub
metric:
prefix: {{ $.Values.metric.prefix }}
skip_create_descriptor: true
service:
extensions: [pprof, zpages, health_check]
pipelines:
traces:
receivers: [otlp/traces, zipkin]
processors: [memory_limiter, tail_sampling, tail_sampling/2, attributes/tracetag, batch]
exporters: [zipkin]
metrics/app:
receivers: [otlp/metrics]
processors: [memory_limiter]
exporters: [prometheus/istio]
There is an issue I found on the envoy side, could it be related to this? envoyproxy/envoy#15904
Yeah seems like it.
The comment of the envoyproxy/envoy#15904 issue says to configure the overload manager. Is it possible to configure the overload manager in Istio? If I have to use gRPC, how do I configure it in Istio?
You should be able to configure it with EnvoyFilter, which allows you to customize Bootstrap.
I will close this issue because the cause of the issue and how to solve it have been revealed. I'll reopen it if necessary. Thank you!
Bug Description
Istio proxy memory usage keeps increasing until an OOM error occurs. In the end, the memory of Istio proxy exceeds 2GB. The number of times the service is called is very small. (No more than 1 TPS)
Version
Additional Information
profile021.pdf
bug-report.txt
Affected product area
Is this the right place to submit this?