prometheus / cloudwatch_exporter

Metrics exporter for Amazon AWS CloudWatch
Apache License 2.0
893 stars 324 forks source link

[metrics]: Exporter randomly detached from service-account #670

Open b-lancaster opened 6 months ago

b-lancaster commented 6 months ago

Context information

Exporter configuration ```yaml apiVersion: v1 kind: ConfigMap metadata: name: cloudwatch-monitoring-general namespace: monitoring data: config.yml: | --- region: us-east-1 delay_seconds: 0 set_timestamp: false use_get_metric_data: true metrics: - aws_namespace: AWS/Lambda aws_metric_name: Errors aws_dimensions: [FunctionName] aws_statistics: [Sum] - aws_namespace: AWS/Lambda aws_metric_name: Invocations aws_dimensions: [FunctionName] aws_statistics: [Sum] - aws_namespace: AWS/Lambda aws_metric_name: Duration aws_dimensions: [FunctionName] aws_statistics: [Average] - aws_namespace: AWS/Lambda aws_metric_name: Throttles aws_dimensions: [FunctionName] aws_statistics: [Sum] - aws_namespace: AWS/Lambda aws_metric_name: OffsetLag aws_dimensions: [FunctionName] aws_statistics: [Maximum] - aws_namespace: AWS/ES aws_metric_name: ThreadpoolIndexQueue aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: ThreadpoolWriteQueue aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: ThreadpoolSearchQueue aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: ThreadpoolIndexQueue aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: ThreadpoolWriteQueue aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: ThreadpoolSearchQueue aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: WriteLatency aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: WriteLatency aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: ReadLatency aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: SearchLatency aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: SearchLatency aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: IndexingLatency aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: IndexingLatency aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: IndexingRate aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: IndexingRate aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: SearchRate aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: SearchRate aws_dimensions: [ClientId, DomainName] aws_extended_statistics: [p95] - aws_namespace: AWS/ES aws_metric_name: 5xx aws_dimensions: [ClientId, DomainName] aws_statistics: [Sum] - aws_namespace: AWS/ES aws_metric_name: 2xx aws_dimensions: [ClientId, DomainName] aws_statistics: [Sum] - aws_namespace: AWS/ES aws_metric_name: 3xx aws_dimensions: [ClientId, DomainName] aws_statistics: [Sum] - aws_namespace: AWS/ES aws_metric_name: 4xx aws_dimensions: [ClientId, DomainName] aws_statistics: [Sum] - aws_namespace: AWS/ES aws_metric_name: ClusterStatus.red aws_dimensions: [ClientId, DomainName] aws_statistics: [Maximum] - aws_namespace: AWS/ES aws_metric_name: ClusterStatus.yellow aws_dimensions: [ClientId, DomainName] aws_statistics: [Maximum] - aws_namespace: AWS/ES aws_metric_name: ClusterIndexWritesBlocked aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: Nodes aws_dimensions: [ClientId, DomainName] aws_statistics: [Minimum] - aws_namespace: AWS/ES aws_metric_name: AutomatedSnapshotFailure aws_dimensions: [ClientId, DomainName] aws_statistics: [Maximum] - aws_namespace: AWS/ES aws_metric_name: KibanaHealthyNodes aws_dimensions: [ClientId, DomainName] aws_statistics: [Minimum] - aws_namespace: AWS/ES aws_metric_name: CPUUtilization aws_dimensions: [ClientId, DomainName] aws_statistics: [Average] - aws_namespace: AWS/ES aws_metric_name: FreeStorageSpace aws_dimensions: [ClientId, DomainName] aws_statistics: [Minimum] - aws_namespace: AWS/ES aws_metric_name: JVMMemoryPressure aws_dimensions: [ClientId, DomainName] aws_statistics: [Maximum] ```
Service Account ```yaml apiVersion: v1 kind: ServiceAccount metadata: labels: app.kubernetes.io/name: cloudwatch-exporter name: cloudwatch-exporter namespace: monitoring annotations: eks.amazonaws.com/role-arn: arn:aws:iam:::role/CloudWatchMetricsReadOnlyRole ```
IAM Role ```yaml # Role for cloudwatch metrics exporter rCloudWatchMetricsReadOnlyRole: Type: 'AWS::IAM::Role' Properties: AssumeRolePolicyDocument: !Sub - | { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "${IamOidcProviderArn}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "${OidcProviderEndpoint}:sub": "system:serviceaccount:monitoring:cloudwatch-exporter" } } } ] } - { "IamOidcProviderArn": !Ref pOidcProviderArn, "OidcProviderEndpoint": !Ref pIssuerHostPath } Path: / ManagedPolicyArns: - arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess ```
Deployment ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: cloudwatch-metrics-exporter-general labels: app.kubernetes.io/name: cloudwatch-metrics-exporter app.kubernetes.io/instance: cloudwatch-metrics-exporter-general namespace: monitoring annotations: reloader.stakater.com/auto: "true" spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: cloudwatch-metrics-exporter app.kubernetes.io/instance: cloudwatch-metrics-exporter-general template: metadata: labels: app.kubernetes.io/name: cloudwatch-metrics-exporter app.kubernetes.io/instance: cloudwatch-metrics-exporter-general spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: k8s.swacorp.com/instancegroup operator: In values: - operations-job-nodes - arm-operations-job-nodes - key: kubernetes.io/arch operator: In values: - arm64 - amd64 preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: kubernetes.io/arch operator: In values: - arm64 tolerations: - effect: NoExecute key: k8s.swacorp.com/dedicated operator: Equal value: operations-server - effect: NoSchedule key: kubernetes.io/arch operator: Equal value: arm64 serviceAccountName: cloudwatch-exporter containers: - name: cloudwatch-metrics-exporter image: quay.io/prometheus/cloudwatch-exporter:v0.15.5 ports: - containerPort: 9106 resources: requests: cpu: 100m memory: 600Mi volumeMounts: - mountPath: /config name: cloudwatch-metric-general volumes: - configMap: name: cloudwatch-monitoring-general name: cloudwatch-metric-general ```
Exporter logs ```log Mar 15, 2024 4:15:46 PM io.prometheus.cloudwatch.CloudWatchCollector collect WARNING: CloudWatch scrape failed software.amazon.awssdk.services.cloudwatch.model.CloudWatchException: User: arn:aws:sts:::assumed-role// is not authorized to perform: cloudwatch:GetMetricData because no identity-based policy allows the cloudwatch:GetMetricData action (Service: CloudWatch, Status Code: 403, Request ID: 7079164c-7404-48cf-98e2-8b13d5ccf27a) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:41) at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50) at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38) at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78) at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39) at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81) at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36) at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32) at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74) at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53) at software.amazon.awssdk.services.cloudwatch.DefaultCloudWatchClient.getMetricData(DefaultCloudWatchClient.java:1249) at io.prometheus.cloudwatch.GetMetricDataDataGetter.fetchAllDataPoints(GetMetricDataDataGetter.java:138) at io.prometheus.cloudwatch.GetMetricDataDataGetter.(GetMetricDataDataGetter.java:185) at io.prometheus.cloudwatch.CloudWatchCollector.scrape(CloudWatchCollector.java:486) at io.prometheus.cloudwatch.CloudWatchCollector.collect(CloudWatchCollector.java:642) at io.prometheus.client.Collector.collect(Collector.java:45) at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:204) at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.(CollectorRegistry.java:162) at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.(CollectorRegistry.java:190) at io.prometheus.client.CollectorRegistry.metricFamilySamples(CollectorRegistry.java:129) at io.prometheus.client.servlet.common.exporter.Exporter.doGet(Exporter.java:75) at io.prometheus.client.servlet.jakarta.exporter.MetricsServlet.doGet(MetricsServlet.java:52) at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:500) at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:529) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1381) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1303) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) at org.eclipse.jetty.server.Server.handle(Server.java:563) at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) at java.base/java.lang.Thread.run(Unknown Source) ```

What do you expect to happen?

I expected the cloudwatch-exporter to use the attached service-account with the permissions necessary to retrieve metric data.

What happened instead?

What actually happened was the cloudwatch-exporter stopped using the service-account and tried to use the k8s nodes IAM role. Nothing changed, we just stopped recieving the metrics in prometheus and then found the logs.

Restarting the deployment fixed the problem and it started using the service-account again, but the problem is if this would have happened in a production environment, the prometheus alerts we've setup to monitor these metrics wouldnt have met the threshold needed to fire.

Also, without looking at the logs, the pod appeared to be running as normal.

saarw-opti commented 5 months ago

Same here; I don't know why it's not using the service account, and it prefers to use the Node role.

Ca-moes commented 5 months ago

Same here, on a fresh deployment using https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-cloudwatch-exporter

Restarting the Deployment does not fix it. It uses the Karpenter role