Loki Backend hangs indefinitely during shutdown

uhthomas commented 1 year ago

Describe the bug

The Loki Backend container will hang indefinitely when attempting to shutdown. The rollout only progresses due to the 5 minute grace period from Kubernetes.

❯ k -n loki get po
NAME                           READY   STATUS    RESTARTS   AGE
loki-backend-0                 1/1     Running   0          3m16s
loki-backend-1                 1/1     Running   0          9m7s
loki-backend-2                 1/1     Running   0          14m

To Reproduce Steps to reproduce the behavior:

Install the Loki helm chart.
Restart the stateful set.

k -n loki rollout restart sts loki-backend

Expected behavior It should gracefully shutdown in a timely manner.

Environment:

Infrastructure: Kubernetes
Deployment tool: Helm

Screenshots, Promtail config, or terminal output

N/A - see above.

buroa commented 1 year ago

+1, seeing this in my lab as well. I'm unable to have the Helm chart rollout successfully (version update) due to the loki-backend pods never terminating, so it get's in a stuck loop. I started noticing this behavior about a week ago. The only way to "update" is to delete everything (deployments, statefulsets) and let it deploy fresh.

uhthomas commented 1 year ago

I forgot to include the logs.... Hopefully they're helpful in understanding what's getting stuck.

level=info ts=2023-05-25T16:07:39.299200425Z caller=signals.go:55 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
level=info ts=2023-05-25T16:07:39.299498457Z caller=manager.go:265 msg="stopping user managers"
level=info ts=2023-05-25T16:07:39.29951218Z caller=manager.go:279 msg="all user managers stopped"
level=info ts=2023-05-25T16:07:39.299529947Z caller=mapper.go:47 msg="cleaning up mapped rules directory" path=/var/loki/rules-temp
level=info ts=2023-05-25T16:07:39.29956959Z caller=module_service.go:114 msg="module stopped" module=ruler
level=info ts=2023-05-25T16:07:39.299720754Z caller=compactor.go:393 msg="compactor exiting"
level=info ts=2023-05-25T16:07:39.299748845Z caller=basic_lifecycler.go:202 msg="ring lifecycler is shutting down" ring=compactor
level=info ts=2023-05-25T16:07:39.299881392Z caller=module_service.go:114 msg="module stopped" module=query-scheduler
level=warn ts=2023-05-25T16:07:39.299908949Z caller=grpc_logging.go:64 duration=1h10m37.310410219s method=/schedulerpb.SchedulerForQuerier/QuerierLoop err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.299932647Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h10m37.310393585s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.299917799Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h4m57.68348322s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.299991222Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h5m1.146566658s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.300020361Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h5m0.942151554s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.300103127Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h10m34.362229459s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.300090077Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h10m39.892106686s err="queue is stopped" msg=gRPC
level=info ts=2023-05-25T16:07:39.300105737Z caller=module_service.go:114 msg="module stopped" module=index-gateway
level=info ts=2023-05-25T16:07:39.3001402Z caller=basic_lifecycler.go:372 msg="unregistering instance from ring" ring=compactor
level=warn ts=2023-05-25T16:07:39.300160241Z caller=grpc_logging.go:64 duration=1h5m0.94236336s method=/schedulerpb.SchedulerForQuerier/QuerierLoop err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.300197945Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h10m34.36239909s err="queue is stopped" msg=gRPC
level=info ts=2023-05-25T16:07:39.300219547Z caller=module_service.go:114 msg="module stopped" module=store
level=info ts=2023-05-25T16:07:39.300234251Z caller=basic_lifecycler.go:242 msg="instance removed from the ring" ring=compactor
level=info ts=2023-05-25T16:07:39.300277319Z caller=module_service.go:114 msg="module stopped" module=ingester-querier
level=info ts=2023-05-25T16:07:39.300283645Z caller=module_service.go:114 msg="module stopped" module=compactor
level=info ts=2023-05-25T16:07:39.300293796Z caller=basic_lifecycler.go:202 msg="ring lifecycler is shutting down" ring=index-gateway
level=info ts=2023-05-25T16:07:39.300330834Z caller=module_service.go:114 msg="module stopped" module=usage-report
level=info ts=2023-05-25T16:07:39.300348543Z caller=module_service.go:114 msg="module stopped" module=ring
level=info ts=2023-05-25T16:07:39.300513653Z caller=basic_lifecycler.go:372 msg="unregistering instance from ring" ring=index-gateway
level=info ts=2023-05-25T16:07:39.300631501Z caller=basic_lifecycler.go:242 msg="instance removed from the ring" ring=index-gateway
level=info ts=2023-05-25T16:07:39.300691522Z caller=module_service.go:114 msg="module stopped" module=index-gateway-ring
level=info ts=2023-05-25T16:07:39.300763845Z caller=module_service.go:114 msg="module stopped" module=runtime-config
level=info ts=2023-05-25T16:07:39.300804459Z caller=memberlist_client.go:641 msg="leaving memberlist cluster"
level=warn ts=2023-05-25T16:07:39.809692654Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=204.959µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.811677406Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=110.869µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.817323365Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=51.33µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.946732016Z caller=grpc_logging.go:64 duration=198.244µs method=/schedulerpb.SchedulerForQuerier/QuerierLoop err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.949491523Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=54.409µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.951243087Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=56.497µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.956390936Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=204.032µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.987964926Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=139.639µs err="scheduler is not running" msg=gRPC
level=info ts=2023-05-25T16:07:40.016124479Z caller=module_service.go:114 msg="module stopped" module=memberlist-kv

AndreasSko commented 1 year ago

We have experienced a similar/same problem. For us, the pods finally terminate when we restart the loki-read pods. It seems that they have an active gRPC connection to the backend, which prevents the backend from gracefully terminating the server module. When we set publishNotReadyAddresses in the query-scheduler-discovery service to false the backend pods terminate within a couple of seconds. We are, however, not sure if this is the proper solution, as the documentation states that publishNotReadyAddresses should be set to true:

Doing so eliminates a race condition in which the query frontend address is needed before the query frontend becomes ready when at least one querier connects.

szinn commented 1 year ago

I have noticed this issue as well.

tyriis commented 1 year ago

currently this is a really annoying issue as it require me to manually apply updates of loki on all gitops managed clusters :disappointed:

icereed commented 1 year ago

Do you have embedded cache enabled? When we switched to external cache using Memcached we saw a better behaviour. Maybe the embedded cache doesn't terminate some connections?

jseiser commented 1 year ago

Same issue here. We also have the same problem with the loki-write pods as well though. Terminating the read pods does nothing for us. We have to go and terminate each backend and write pod. We are using AWS EFS for backend storage, S3 for the other storage, no memcached.

This is flux, showing an upgrade timed out.

  - lastTransitionTime: "2023-06-09T02:16:06Z"
    message: |-
      Helm upgrade failed: timed out waiting for the condition

      Last Helm logs:

      resetting values to the chart's original version
      performing update for loki
      creating upgraded release for loki
      waiting for release loki resources (created: 0 updated: 32  deleted: 0)
      warning: Upgrade "loki" failed: timed out waiting for the condition
    reason: UpgradeFailed
    status: "False"
    type: Released
  failures: 75
  helmChart: flux-system/loki-loki
  lastAppliedRevision: 5.5.11
  lastAttemptedRevision: 5.6.4

Backend Stuck

❯ kubectl get pods -n loki
NAME                                           READY   STATUS        RESTARTS   AGE
loki-backend-0                                 0/2     Terminating   0          8d
loki-backend-1                                 2/2     Running       0          50m
loki-backend-2                                 2/2     Running       0          58m
loki-gateway-5c877d69d-7tb5p                   2/2     Running       0          11h
loki-gateway-5c877d69d-flpk9                   2/2     Running       0          57m
loki-grafana-agent-operator-5555fc45d8-8nlhv   2/2     Running       0          8d
loki-read-fb944c84-brn8b                       2/2     Running       0          11h
loki-read-fb944c84-f24rq                       2/2     Running       0          119m
loki-read-fb944c84-sf9cl                       2/2     Running       0          11h
loki-write-0                                 0/2     Terminating   0          8d
loki-write-1                                   2/2     Running       0          55m
loki-write-2                                   2/2     Running       0          11h

Pod describe

❯ kubectl describe pod/loki-backend-0 -n loki
Name:                      loki-backend-0
Namespace:                 loki
Priority:                  0
Service Account:           loki-sa
Node:                      ip-10-16-2-160.us-gov-west-1.compute.internal/10.16.2.160
Start Time:                Wed, 31 May 2023 15:07:32 -0400
Labels:                    app.kubernetes.io/component=backend
                           app.kubernetes.io/instance=loki
                           app.kubernetes.io/name=loki
                           app.kubernetes.io/part-of=memberlist
                           controller-revision-hash=loki-backend-5c8db54c8b
                           linkerd.io/control-plane-ns=linkerd
                           linkerd.io/proxy-statefulset=loki-backend
                           linkerd.io/workload-ns=loki
                           statefulset.kubernetes.io/pod-name=loki-backend-0
Annotations:               checksum/config: cf83c3c210bb43ee1a6b348ac3e1a817508fa5aa3a753d47d5add77cf62e776d
                           linkerd.io/created-by: linkerd/proxy-injector stable-2.13.3
                           linkerd.io/inject: enabled
                           linkerd.io/proxy-version: stable-2.13.3
                           linkerd.io/trust-root-sha256: a74d25fec78ad055f32243a15a92f754746dcf72c23c64fc95fea4c2dd6c5227
                           viz.linkerd.io/tap-enabled: true
Status:                    Terminating (lasts 46m)
Termination Grace Period:  300s
IP:                        10.16.2.167
IPs:
  IP:           10.16.2.167
Controlled By:  StatefulSet/loki-backend
Init Containers:
  linkerd-init:
    Container ID:  containerd://a23ff938eaf1656bc04ea5028b5ab62821e199a6d1dfe9055d91f71bb1112a88
    Image:         cr.l5d.io/linkerd/proxy-init:v2.2.1
    Image ID:      cr.l5d.io/linkerd/proxy-init@sha256:20349a461f9fb76fde33741a90f9de2a647068f506325ac5e0faf7b7bc2eea72
    Port:          <none>
    Host Port:     <none>
    Args:
      --incoming-proxy-port
      4143
      --outgoing-proxy-port
      4140
      --proxy-uid
      2102
      --inbound-ports-to-ignore
      4190,4191,4567,4568
      --outbound-ports-to-ignore
      4567,4568
      --log-format
      json
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 31 May 2023 15:07:34 -0400
      Finished:     Wed, 31 May 2023 15:07:35 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  20Mi
    Requests:
      cpu:     100m
      memory:  20Mi
    Environment:
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_DEFAULT_REGION:           us-gov-west-1
      AWS_REGION:                   us-gov-west-1
      AWS_ROLE_ARN:                 arn:aws-us-gov:iam::125005550194:role/role-prod-loki
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /run from linkerd-proxy-init-xtables-lock (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nz4jn (ro)
Containers:
  linkerd-proxy:
    Container ID:   containerd://5d2935fc77b591423182707a177181a03cd7c77cff0d12a4dcfb4f29de6c32da
    Image:          cr.l5d.io/linkerd/proxy:stable-2.13.3
    Image ID:       cr.l5d.io/linkerd/proxy@sha256:faed350dae1e4ffcba2f0676288c89557b376199c63eff2037bc780fed9d44c3
    Ports:          4143/TCP, 4191/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 31 May 2023 15:07:36 -0400
      Finished:     Fri, 09 Jun 2023 09:03:23 -0400
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   20Mi
    Liveness:   http-get http://:4191/live delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:4191/ready delay=2s timeout=1s period=10s #success=1 #failure=3
    Environment:
      _pod_name:                                                loki-backend-0 (v1:metadata.name)
      _pod_ns:                                                  loki (v1:metadata.namespace)
      _pod_nodeName:                                             (v1:spec.nodeName)
      LINKERD2_PROXY_LOG:                                       warn,linkerd=info,trust_dns=error
      LINKERD2_PROXY_LOG_FORMAT:                                json
      LINKERD2_PROXY_DESTINATION_SVC_ADDR:                      linkerd-dst-headless.linkerd.svc.cluster.local.:8086
      LINKERD2_PROXY_DESTINATION_PROFILE_NETWORKS:              10.0.0.0/8,100.64.0.0/10,172.16.0.0/12,192.168.0.0/16
      LINKERD2_PROXY_POLICY_SVC_ADDR:                           linkerd-policy.linkerd.svc.cluster.local.:8090
      LINKERD2_PROXY_POLICY_WORKLOAD:                           $(_pod_ns):$(_pod_name)
      LINKERD2_PROXY_INBOUND_DEFAULT_POLICY:                    all-unauthenticated
      LINKERD2_PROXY_POLICY_CLUSTER_NETWORKS:                   10.0.0.0/8,100.64.0.0/10,172.16.0.0/12,192.168.0.0/16
      LINKERD2_PROXY_INBOUND_CONNECT_TIMEOUT:                   100ms
      LINKERD2_PROXY_OUTBOUND_CONNECT_TIMEOUT:                  1000ms
      LINKERD2_PROXY_CONTROL_LISTEN_ADDR:                       0.0.0.0:4190
      LINKERD2_PROXY_ADMIN_LISTEN_ADDR:                         0.0.0.0:4191
      LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR:                      127.0.0.1:4140
      LINKERD2_PROXY_INBOUND_LISTEN_ADDR:                       0.0.0.0:4143
      LINKERD2_PROXY_INBOUND_IPS:                                (v1:status.podIPs)
      LINKERD2_PROXY_INBOUND_PORTS:                             3100,7946,9095
      LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES:              svc.cluster.local.
      LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE:                  10000ms
      LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE:                10000ms
      LINKERD2_PROXY_INBOUND_PORTS_DISABLE_PROTOCOL_DETECTION:  25,587,3306,4444,5432,6379,9300,11211
      LINKERD2_PROXY_DESTINATION_CONTEXT:                       {"ns":"$(_pod_ns)", "nodeName":"$(_pod_nodeName)"}

      _pod_sa:                                                   (v1:spec.serviceAccountName)
      _l5d_ns:                                                  linkerd
      _l5d_trustdomain:                                         cluster.local
      LINKERD2_PROXY_IDENTITY_DIR:                              /var/run/linkerd/identity/end-entity
      LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS:                    -----BEGIN CERTIFICATE-----
                                                                MIIBizCCATGgAwIBAgIRAPqmcyoDmzFOkMveNhvVNC0wCgYIKoZIzj0EAwIwJTEj
                                                                MCEGA1UEAxMacm9vdC5saW5rZXJkLmNsdXN0ZXIubG9jYWwwHhcNMjMwNTAxMTMx
                                                                ODE4WhcNMjMwNzMwMTMxODE4WjAlMSMwIQYDVQQDExpyb290LmxpbmtlcmQuY2x1
                                                                c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABFX9i+cfgNAxIjnb
                                                                cAfJ7l3yIsMIjv0qu/p9RdGZUsK4aiDxKDgWitfqMBxIOBXnUeUROnsJE2pffgw9
                                                                KW/tZYWjQjBAMA4GA1UdDwEB/wQEAwICpDAPBgNVHRMBAf8EBTADAQH/MB0GA1Ud
                                                                DgQWBBTOOAjOqefPASSXYUiA0H6P8ZXrZjAKBggqhkjOPQQDAgNIADBFAiEAn5Nc
                                                                YwXPXkP8qtKJb2fYeVzGXN/zK57c6b5KVvN6w1YCIC0lu55Sz8Y6tpGmaC7QJMCc
                                                                BZcpsTE1FcX5fqpMKXSV
                                                                -----END CERTIFICATE-----

      LINKERD2_PROXY_IDENTITY_TOKEN_FILE:                       /var/run/secrets/tokens/linkerd-identity-token
      LINKERD2_PROXY_IDENTITY_SVC_ADDR:                         linkerd-identity-headless.linkerd.svc.cluster.local.:8080
      LINKERD2_PROXY_IDENTITY_LOCAL_NAME:                       $(_pod_sa).$(_pod_ns).serviceaccount.identity.linkerd.cluster.local
      LINKERD2_PROXY_IDENTITY_SVC_NAME:                         linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
      LINKERD2_PROXY_DESTINATION_SVC_NAME:                      linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
      LINKERD2_PROXY_POLICY_SVC_NAME:                           linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
      LINKERD2_PROXY_TAP_SVC_NAME:                              tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local
      AWS_STS_REGIONAL_ENDPOINTS:                               regional
      AWS_DEFAULT_REGION:                                       us-gov-west-1
      AWS_REGION:                                               us-gov-west-1
      AWS_ROLE_ARN:                                             arn:aws-us-gov:iam::125005550194:role/role-prod-loki
      AWS_WEB_IDENTITY_TOKEN_FILE:                              /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/linkerd/identity/end-entity from linkerd-identity-end-entity (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nz4jn (ro)
      /var/run/secrets/tokens from linkerd-identity-token (rw)
  loki:
    Container ID:  containerd://5a6d8ee723a9e14e98afe418b6963ea9598da15b62e7f8821714249e75629cb5
    Image:         docker.io/grafana/loki:2.8.2
    Image ID:      docker.io/grafana/loki@sha256:b1da1d23037eb1b344cccfc5b587e30aed60ab4cad33b42890ff850aa3c4755d
    Ports:         3100/TCP, 9095/TCP, 7946/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      -config.file=/etc/loki/config/config.yaml
      -target=backend
      -legacy-read-mode=false
    State:          Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 31 May 2023 15:07:39 -0400
      Finished:     Fri, 09 Jun 2023 09:06:23 -0400
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:http-metrics/ready delay=30s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_DEFAULT_REGION:           us-gov-west-1
      AWS_REGION:                   us-gov-west-1
      AWS_ROLE_ARN:                 arn:aws-us-gov:iam::125005550194:role/role-prod-loki
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /etc/loki/config from config (rw)
      /etc/loki/runtime-config from runtime-config (rw)
      /tmp from tmp (rw)
      /var/loki from data (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nz4jn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-loki-backend-0
    ReadOnly:   false
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      loki
    Optional:  false
  runtime-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      loki-runtime
    Optional:  false
  kube-api-access-nz4jn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  linkerd-proxy-init-xtables-lock:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  linkerd-identity-end-entity:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  linkerd-identity-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From     Message
  ----     ------     ----                ----     -------
  Normal   Killing    51m                 kubelet  Stopping container linkerd-proxy
  Normal   Killing    51m                 kubelet  Stopping container loki
  Warning  Unhealthy  49m (x11 over 51m)  kubelet  Readiness probe failed: Get "http://10.16.2.167:3100/ready": dial tcp 10.16.2.167:3100: connect: connection refused
  Warning  Unhealthy  49m (x12 over 51m)  kubelet  Readiness probe failed: Get "http://10.16.2.167:4191/ready": dial tcp 10.16.2.167:4191: connect: connection refused

jseiser commented 1 year ago

Wanted to add, we deployed memcache for read/chunks, and also added some port exclusions for linkerd.

End result was no change, same issue.

❯ kubectl get pods -n loki
NAME                                           READY   STATUS        RESTARTS   AGE
chunk-cache-memcached-66bb59d584-frt47         3/3     Running       0          4d3h
loki-backend-0                                 0/2     Terminating   0          5d22h

Last Helm logs:

resetting values to the chart's original version
performing update for loki
creating upgraded release for loki
waiting for release loki resources (created: 0 updated: 32  deleted: 0)
warning: Upgrade "loki" failed: context deadline exceeded

stdmje commented 1 year ago

I have noticed the same issue.

NilsGriebner commented 1 year ago

Same here.

jseiser commented 1 year ago

We migrated to autoscaling, and I have not hit this issue again. The autoscaling, comes with a lifecycle script, which I assume it working around the problem thats causing the normal deployment to hang.

https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L735

-- The default /flush_shutdown preStop hook is recommended as part of the ingester

scaledown process so it's added to the template by default when autoscaling is enabled,

but it's disabled to optimize rolling restarts in instances that will never be scaled

down or when using chunks storage with WAL disabled.

https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown

This is just a guess though.

tyriis commented 1 year ago

any chance to get some official feedback or a possible workaround, I have currently disabled loki upgrades to automerge in renovate. This is not a solution :( for anyone interested in the config : https://github.com/tyriis/renovate-config/blob/main/flux/prevent-automerge-loki.json5

epneo-banuprakash commented 10 months ago

i am facing the above issue with grafana/loki-stack with hlem chart in AKS 1.25 version any help??

cdancy commented 9 months ago

We're seeing similar issues here as well. Tagging some developers in hope of getting some traction or at least a response of some kind here.

@JStickler @chaudum @kavirajk

tyriis commented 9 months ago

I have found a way to live with it, by accepting that a rollout with 3 backends will take ~20min to reconcile :(

epneo-banuprakash commented 9 months ago

i am facing the above issue with grafana/loki-stack with hlem chart in AKS 1.25 version any help??

some deep digging into that issue has been resolved..thanks

uhthomas commented 9 months ago

i am facing the above issue with grafana/loki-stack with hlem chart in AKS 1.25 version any help??

some deep digging into that issue has been resolved..thanks

Off-topic. The on-topic issue is not resolved.

andrewgkew commented 6 months ago

I am facing this exact issue in EKS, any updates on how I can get my loki log pods to terminate on a helm uninstall?

tyriis commented 6 months ago

I am facing this exact issue in EKS, any updates on how I can get my loki log pods to terminate on a helm uninstall?

hey @andrewgkew I am using flux, increasing the timeout for a helm release from 5 -> 15 min solved it for me (I use 3 replica) if you have more you need to find a timeout where each replica can be killed after wait + healthcheck + start time hope it helps.

andrewgkew commented 6 months ago

I am facing this exact issue in EKS, any updates on how I can get my loki log pods to terminate on a helm uninstall?

hey @andrewgkew I am using flux, increasing the timeout for a helm release from 5 -> 15 min solved it for me (I use 3 replica) if you have more you need to find a timeout where each replica can be killed after wait + healthcheck + start time hope it helps.

@tyriis thanks for the tip. But this means my removing of the pods could take 15 minutes? Thats crazy. I assume this is just a work around? Any idea whats causing it and why Grafana are not fixing the issue?

tyriis commented 6 months ago

@andrewgkew yeah ofc this is a workaround (and it does not scale) as this thread is not responded, I live with it for now

clawoflight commented 6 months ago

We also experience this issue, and it prevents rolling patches of the cluster because nodes cannot be drained properly. How can we help debug this?

EraYaN commented 5 months ago

This also seems to break single binary mode when a node shuts down and it corrupts all data. Loki just hangs for too long and then linux just kills the process. It really needs to respect that first sigterm it gets, both in backend mode and in singlebinary mode. Like you better flush and shutdown because you are going to get killed soon.

cyriltovena commented 5 months ago

Can someone provides a goroutine dump when that happens please ?

curl http://localhost:<port>/debug/pprof/goroutine?debug=2

elburnetto-intapp commented 5 months ago

If it helps, we've upgraded to V3 and we haven't seen this issue since.

cyriltovena commented 5 months ago

Another way to get the dumb is using kubectl exec

kubectl exec loki-logs-2wj2f -n monitoring —- kill -6 1

andrewgkew commented 5 months ago

@cyriltovena I can confirm that this issue goes away with upgrading to v3 of Loki, thanks @elburnetto-intapp for the tip.

onedr0p commented 5 months ago

@uhthomas You can probably close this issue. I can confirm v3 doesn't have this issue either.

grafana / loki

Loki Backend hangs indefinitely during shutdown #9523

-- The default /flush_shutdown preStop hook is recommended as part of the ingester

scaledown process so it's added to the template by default when autoscaling is enabled,

but it's disabled to optimize rolling restarts in instances that will never be scaled

down or when using chunks storage with WAL disabled.

https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown