giantswarm / prometheus

Kubernetes Setup for Prometheus and Grafana
Apache License 2.0
858 stars 424 forks source link

Alertmanager CrashLoopBackOff #107

Open 80kk opened 5 years ago

80kk commented 5 years ago

Can't make it running and it is causing prometheus-core CrashLoopBackOff:

kubectl describe pods alertmanager-6c6947f55f-h9wc9 -n monitoring
Name:               alertmanager-6c6947f55f-h9wc9
Namespace:          monitoring
Priority:           0
PriorityClassName:  <none>
Node:               node3/192.168.101.15
Start Time:         Wed, 14 Nov 2018 11:28:02 +0000
Labels:             app=alertmanager
                    pod-template-hash=6c6947f55f
Annotations:        <none>
Status:             Running
IP:                 10.233.64.6
Controlled By:      ReplicaSet/alertmanager-6c6947f55f
Containers:
  alertmanager:
    Container ID:  docker://0507cdf93fefdfe12b964117ae9d5e3685eb20d50f2bd504e8e9105963650435
    Image:         quay.io/prometheus/alertmanager:v0.15.3
    Image ID:      docker-pullable://quay.io/prometheus/alertmanager@sha256:27410e5c88aaaf796045e825b257a0857cca0876ca3804ba61175dd8a9f5b798
    Port:          9093/TCP
    Host Port:     0/TCP
    Args:
      -config.file=/etc/alertmanager/config.yml
      -storage.path=/alertmanager
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 14 Nov 2018 11:43:57 +0000
      Finished:     Wed, 14 Nov 2018 11:43:57 +0000
    Ready:          False
    Restart Count:  8
    Environment:    <none>
    Mounts:
      /alertmanager from alertmanager (rw)
      /etc/alertmanager from config-volume (rw)
      /etc/alertmanager-templates from templates-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nvb7c (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      alertmanager
    Optional:  false
  templates-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      alertmanager-templates
    Optional:  false
  alertmanager:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  default-token-nvb7c:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nvb7c
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  16m                 default-scheduler  Successfully assigned monitoring/alertmanager-6c6947f55f-h9wc9 to node3
  Normal   Pulling    16m                 kubelet, node3     pulling image "quay.io/prometheus/alertmanager:v0.15.3"
  Normal   Pulled     15m                 kubelet, node3     Successfully pulled image "quay.io/prometheus/alertmanager:v0.15.3"
  Normal   Created    14m (x5 over 15m)   kubelet, node3     Created container
  Normal   Pulled     14m (x4 over 15m)   kubelet, node3     Container image "quay.io/prometheus/alertmanager:v0.15.3" already present on machine
  Normal   Started    14m (x5 over 15m)   kubelet, node3     Started container
  Warning  BackOff    53s (x73 over 15m)  kubelet, node3     Back-off restarting failed container
pipo02mix commented 5 years ago

Can you share which errors give you in the logs?

$ kubectl logs -l  app=alertmanager -n monitoring
# Add -p flag if it is crash looping to see the previous container error
hennow commented 5 years ago

Even though a bit older, i think i can help here. The crashLoop is showing because you changed the release of the alertmanager from 0.7.1 to 0.15.3. Since 0.12 or 0.13 there has to be 2 dashes in front of the arguments. The config has to be changed to: '--config.file=/etc/alertmanager/config.yml' '--storage.path=/alertmanager' Should fix that issue, at least for the alertmanager. Trying to change other releases within the manifest causes new issues.