[prometheus] AWS EKS Prometheus helm failed pod CrashLoopBackOff

harry-hathorn commented 1 year ago

Describe the bug a clear and concise description of what the bug is.

Prometheus service failing in CrashLoopBackOff with the logs of level=error ts=2023-08-04T18:22:42.675721017Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://127.0.0.1:9090/-/reload\": dial tcp 127.0.0.1:9090: connect: connection refused"

All other parts of

What's your helm version?

v3.11.2

What's your kubectl version?

v4.5.7

Which chart?

prometheus-community/prometheus

What's the chart version?

appVersion: v2.46.0 apiVersion: v2 kubeVersion: '>=1.16.0-0'

What happened?

I followed the steps here https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-ingest-metrics-new-Prometheus.html

In a nutshell, I installed the Prometheus community helm chart on my EKS cluster in AWS https://github.com/prometheus-community/helm-charts.

My Kubernetes version is 1.27

The cluster node group has two t3.large nodes (8gb memory, 2vCPU)

After creation, I have:

Pods:

kubectl get pods -n prometheus

NAME                                                        READY   STATUS             RESTARTS         AGE
prometheus-alertmanager-0                            1/1     Running            0                142m
prometheus-kube-state-metrics-69b5df7dc4-kfqc2       1/1     Running            0                142m
prometheus-prometheus-node-exporter-dkksq            1/1     Running            0                142m
prometheus-prometheus-node-exporter-fzr5z            1/1     Running            0                142m
prometheus-prometheus-pushgateway-847c6f4d57-dz5wx   1/1     Running            0                142m
prometheus-server-5fbb47548d-sxm72                   1/2     CrashLoopBackOff   32 (3m58s ago)   142m

Pod logs:

 kubectl logs prometheus-server-5fbb47548d-sxm72 -n prometheus

....
level=error ts=2023-08-04T17:45:27.636671778Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://127.0.0.1:9090/-/reload\": dial tcp 127.0.0.1:9090: connect: connection refused"

Deployments:

kubectl get deployments -n prometheus
NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
prometheus-kube-state-metrics       1/1     1            1           144m
prometheus-prometheus-pushgateway   1/1     1            1           144m
prometheus-server                   0/1     1            0           144m

Describe failed deployment:

kubectl describe deployments prometheus-server -n prometheus

...
 Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    False   ProgressDeadlineExceeded

The EBS volumes have successfully bound.

It seems that the prometheus-server pod is stuck in CrashLoopBackOff because it can't connect to itself on 127.0.0.1:9090

I have been searching for an answer all day and trying different things but can't solve the issue.

My helm values looks like this:

## The following is a set of default values for prometheus server helm chart which enable remoteWrite to AMP
## For the rest of prometheus helm chart values see: https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus/values.yaml
##
serviceAccounts:
  server:
    name: amp-iamproxy-ingest-service-account
    annotations: 
      eks.amazonaws.com/role-arn: arn:aws:iam:###########:role/service-role/Amazon_EventBridge_Invoke_Api_Destination_###########
server:
  remoteWrite:
    - url: https://aps-workspaces.eu-west-1.amazonaws.com/workspaces/#############/api/v1/remote_write
      sigv4:
        region: eu-west-1
      queue_config:
        max_samples_per_send: 1000
        max_shards: 200
        capacity: 2500

What you expected to happen?

I expect the prometheus-server pod to run

How to reproduce it?

Add helm chart repo: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add kube-state-metrics https://kubernetes.github.io/kube-state-metrics helm repo update

create namespace kubectl create namespace prometheus-namespace

For amazon EKS set up service roles Set up service roles for the ingestion of metrics from Amazon EKS clusters

Create your values yaml file serviceAccounts: server: name: amp-iamproxy-ingest-service-account annotations: eks.amazonaws.com/role-arn: ${IAM_PROXY_PROMETHEUS_ROLE_ARN} server: remoteWrite:

url: https://aps-workspaces.${REGION}.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write sigv4: region: ${REGION} queue_config: max_samples_per_send: 1000 max_shards: 200 capacity: 2500

Run the command: helm install prometheus-chart-name prometheus-community/prometheus -n prometheus-namespace \ -f my_prometheus_values_yaml

Run the command kubectl get deployments -n prometheus and find the failing prometheus service pod

Enter the changed values of values.yaml?

serviceAccounts: server: name: amp-iamproxy-ingest-service-account annotations: eks.amazonaws.com/role-arn: ${IAM_PROXY_PROMETHEUS_ROLE_ARN} server: remoteWrite:

url: https://aps-workspaces.${REGION}.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write sigv4: region: ${REGION} queue_config: max_samples_per_send: 1000 max_shards: 200 capacity: 2500

Enter the command that you execute and failing/misfunctioning.

helm install prometheus-chart-name prometheus-community/prometheus -n prometheus-namespace \ -f my_prometheus_values_yaml

Anything else we need to know?

No response

zeritti commented 1 year ago

It seems that the prometheus-server pod is stuck in CrashLoopBackOff because it can't connect to itself on 127.0.0.1:9090

The errors you see are being produced by config-reloader, a sidecar container, not by prometheus. It is failing to connect to prometheus, prometheus is not running as seen in the output. You should be able to get more info from the prometheus container on the cause of its crashing by specifying the container name (default is prometheus-server) with the -c option in the command, e.g.

kubectl logs POD_NAME -c prometheus-server

or for all containers

kubectl logs POD_NAME --all-containers=true

laszlolaszlo commented 1 year ago

Hello, I have got a very similar error not in AWS but minikube. The prometheus-server container logs are:

ts=2023-08-21T10:28:33.653Z caller=main.go:590 level=info build_context="(go=go1.20.6, platform=linux/amd64, user=root@42454fc0f41e, date=20230725-12:31:24, tags=netgo,builtinassets,stringlabels)"
ts=2023-08-21T10:28:33.653Z caller=main.go:591 level=info host_details="(Linux 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 prometheus-1692613669-server-b7f497754-4vpxj (none))"
ts=2023-08-21T10:28:33.653Z caller=main.go:592 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-08-21T10:28:33.653Z caller=main.go:593 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-08-21T10:28:33.653Z caller=query_logger.go:93 level=error component=activeQueryTracker msg="Error opening query log file" file=/data/queries.active err="open /data/queries.active: permission denied"
panic: Unable to create mmap-ed active query log

goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker({0x7ffde774e368, 0x5}, 0x14, {0x3e97c00, 0xc0006dcaa0})
    /app/promql/query_logger.go:123 +0x42d
main.main()
    /app/cmd/prometheus/main.go:647 +0x74d3

laszlolaszlo commented 1 year ago

In chart there is a persistentVolume definition:

persistentVolume:
    accessModes:
    - ReadWriteOnce
    annotations: {}
    enabled: true
    existingClaim: ""
    labels: {}
    mountPath: /data
    size: 8Gi
    statefulSetNameOverride: ""
    subPath: ""

In Kubernetes the PV and PVC is created well.

kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                     STORAGECLASS   REASON   AGE
pvc-72df14b9-b2ef-4a71-b5bd-3623ceb33d77   2Gi        RWO            Delete           Bound    monitoring/storage-prometheus-1692620441-alertmanager-0   standard                10m
pvc-8204b163-b495-49f9-8a4b-6f4bd20e38d2   8Gi        RWO            Delete           Bound    monitoring/prometheus-1692620441-server                   standard                10m

kubectl get pvc -n monitoring
NAME                                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-1692620441-server                   Bound    pvc-8204b163-b495-49f9-8a4b-6f4bd20e38d2   8Gi        RWO            standard       10m
storage-prometheus-1692620441-alertmanager-0   Bound    pvc-72df14b9-b2ef-4a71-b5bd-3623ceb33d77   2Gi        RWO            standard       10m

In the PODs prometheus-server container there is not any PV mounted into /data. Maybe is this a Minikube issue?

carn1x commented 1 year ago

@laszlolaszlo I'm getting the exact same issue in EKS, so not just a Minikube issue I think.

darvein commented 1 year ago

I'm seeing the same. Worked in minikube 1 server-cluster, but when trying 3 node cluster it started to fail exactly as described in this issue.

tibz7 commented 11 months ago

I'm seeing the same. Worked in minikube 1 server-cluster, but when trying 3 node cluster it started to fail exactly as described in this issue.

I have the exact same problem on the exact same setup.
@darvein did you find a solution yet?

easy1481437320 commented 8 months ago

i met the same issue at AWS EKS , but my issue is cause by OpenID connect missed , i just add it back then the prometheus become normal.

kubectl logs POD_NAME -c prometheus-server

ts=2024-03-14T06:16:21.228Z caller=main.go:1350 level=error msg="Failed to apply configuration" err="could not get SigV4 credentials: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://oidc.eks.xxxxxxx\n\tstatus code: 400, request id: a8260318-e9f7-4fff-b428-6e3da1428276"

rahilk82 commented 3 months ago

The issue is resolved when I removed the user permissions and ran it with root user:

runAsUser: 0 runAsNonRoot: false runAsGroup: 0

prasanthcavli commented 3 months ago

@harry-hathorn Is the issue resolved ? I m facing the same problem. Can someone help ?

prometheus-community / helm-charts