Open uhthomas opened 1 year ago
+1, seeing this in my lab as well. I'm unable to have the Helm chart rollout successfully (version update) due to the loki-backend
pods never terminating, so it get's in a stuck loop. I started noticing this behavior about a week ago. The only way to "update" is to delete everything (deployments, statefulsets) and let it deploy fresh.
I forgot to include the logs.... Hopefully they're helpful in understanding what's getting stuck.
level=info ts=2023-05-25T16:07:39.299200425Z caller=signals.go:55 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
level=info ts=2023-05-25T16:07:39.299498457Z caller=manager.go:265 msg="stopping user managers"
level=info ts=2023-05-25T16:07:39.29951218Z caller=manager.go:279 msg="all user managers stopped"
level=info ts=2023-05-25T16:07:39.299529947Z caller=mapper.go:47 msg="cleaning up mapped rules directory" path=/var/loki/rules-temp
level=info ts=2023-05-25T16:07:39.29956959Z caller=module_service.go:114 msg="module stopped" module=ruler
level=info ts=2023-05-25T16:07:39.299720754Z caller=compactor.go:393 msg="compactor exiting"
level=info ts=2023-05-25T16:07:39.299748845Z caller=basic_lifecycler.go:202 msg="ring lifecycler is shutting down" ring=compactor
level=info ts=2023-05-25T16:07:39.299881392Z caller=module_service.go:114 msg="module stopped" module=query-scheduler
level=warn ts=2023-05-25T16:07:39.299908949Z caller=grpc_logging.go:64 duration=1h10m37.310410219s method=/schedulerpb.SchedulerForQuerier/QuerierLoop err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.299932647Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h10m37.310393585s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.299917799Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h4m57.68348322s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.299991222Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h5m1.146566658s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.300020361Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h5m0.942151554s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.300103127Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h10m34.362229459s err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.300090077Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h10m39.892106686s err="queue is stopped" msg=gRPC
level=info ts=2023-05-25T16:07:39.300105737Z caller=module_service.go:114 msg="module stopped" module=index-gateway
level=info ts=2023-05-25T16:07:39.3001402Z caller=basic_lifecycler.go:372 msg="unregistering instance from ring" ring=compactor
level=warn ts=2023-05-25T16:07:39.300160241Z caller=grpc_logging.go:64 duration=1h5m0.94236336s method=/schedulerpb.SchedulerForQuerier/QuerierLoop err="queue is stopped" msg=gRPC
level=warn ts=2023-05-25T16:07:39.300197945Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=1h10m34.36239909s err="queue is stopped" msg=gRPC
level=info ts=2023-05-25T16:07:39.300219547Z caller=module_service.go:114 msg="module stopped" module=store
level=info ts=2023-05-25T16:07:39.300234251Z caller=basic_lifecycler.go:242 msg="instance removed from the ring" ring=compactor
level=info ts=2023-05-25T16:07:39.300277319Z caller=module_service.go:114 msg="module stopped" module=ingester-querier
level=info ts=2023-05-25T16:07:39.300283645Z caller=module_service.go:114 msg="module stopped" module=compactor
level=info ts=2023-05-25T16:07:39.300293796Z caller=basic_lifecycler.go:202 msg="ring lifecycler is shutting down" ring=index-gateway
level=info ts=2023-05-25T16:07:39.300330834Z caller=module_service.go:114 msg="module stopped" module=usage-report
level=info ts=2023-05-25T16:07:39.300348543Z caller=module_service.go:114 msg="module stopped" module=ring
level=info ts=2023-05-25T16:07:39.300513653Z caller=basic_lifecycler.go:372 msg="unregistering instance from ring" ring=index-gateway
level=info ts=2023-05-25T16:07:39.300631501Z caller=basic_lifecycler.go:242 msg="instance removed from the ring" ring=index-gateway
level=info ts=2023-05-25T16:07:39.300691522Z caller=module_service.go:114 msg="module stopped" module=index-gateway-ring
level=info ts=2023-05-25T16:07:39.300763845Z caller=module_service.go:114 msg="module stopped" module=runtime-config
level=info ts=2023-05-25T16:07:39.300804459Z caller=memberlist_client.go:641 msg="leaving memberlist cluster"
level=warn ts=2023-05-25T16:07:39.809692654Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=204.959µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.811677406Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=110.869µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.817323365Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=51.33µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.946732016Z caller=grpc_logging.go:64 duration=198.244µs method=/schedulerpb.SchedulerForQuerier/QuerierLoop err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.949491523Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=54.409µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.951243087Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=56.497µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.956390936Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=204.032µs err="scheduler is not running" msg=gRPC
level=warn ts=2023-05-25T16:07:39.987964926Z caller=grpc_logging.go:64 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=139.639µs err="scheduler is not running" msg=gRPC
level=info ts=2023-05-25T16:07:40.016124479Z caller=module_service.go:114 msg="module stopped" module=memberlist-kv
We have experienced a similar/same problem. For us, the pods finally terminate when we restart the loki-read pods. It seems that they have an active gRPC connection to the backend, which prevents the backend from gracefully terminating the server module.
When we set publishNotReadyAddresses
in the query-scheduler-discovery
service to false
the backend pods terminate within a couple of seconds. We are, however, not sure if this is the proper solution, as the documentation states that publishNotReadyAddresses
should be set to true:
Doing so eliminates a race condition in which the query frontend address is needed before the query frontend becomes ready when at least one querier connects.
I have noticed this issue as well.
currently this is a really annoying issue as it require me to manually apply updates of loki on all gitops managed clusters :disappointed:
Do you have embedded cache enabled? When we switched to external cache using Memcached we saw a better behaviour. Maybe the embedded cache doesn't terminate some connections?
Same issue here. We also have the same problem with the loki-write pods as well though. Terminating the read pods does nothing for us. We have to go and terminate each backend and write pod. We are using AWS EFS for backend storage, S3 for the other storage, no memcached.
This is flux, showing an upgrade timed out.
- lastTransitionTime: "2023-06-09T02:16:06Z"
message: |-
Helm upgrade failed: timed out waiting for the condition
Last Helm logs:
resetting values to the chart's original version
performing update for loki
creating upgraded release for loki
waiting for release loki resources (created: 0 updated: 32 deleted: 0)
warning: Upgrade "loki" failed: timed out waiting for the condition
reason: UpgradeFailed
status: "False"
type: Released
failures: 75
helmChart: flux-system/loki-loki
lastAppliedRevision: 5.5.11
lastAttemptedRevision: 5.6.4
Backend Stuck
❯ kubectl get pods -n loki
NAME READY STATUS RESTARTS AGE
loki-backend-0 0/2 Terminating 0 8d
loki-backend-1 2/2 Running 0 50m
loki-backend-2 2/2 Running 0 58m
loki-gateway-5c877d69d-7tb5p 2/2 Running 0 11h
loki-gateway-5c877d69d-flpk9 2/2 Running 0 57m
loki-grafana-agent-operator-5555fc45d8-8nlhv 2/2 Running 0 8d
loki-read-fb944c84-brn8b 2/2 Running 0 11h
loki-read-fb944c84-f24rq 2/2 Running 0 119m
loki-read-fb944c84-sf9cl 2/2 Running 0 11h
loki-write-0 0/2 Terminating 0 8d
loki-write-1 2/2 Running 0 55m
loki-write-2 2/2 Running 0 11h
Pod describe
❯ kubectl describe pod/loki-backend-0 -n loki
Name: loki-backend-0
Namespace: loki
Priority: 0
Service Account: loki-sa
Node: ip-10-16-2-160.us-gov-west-1.compute.internal/10.16.2.160
Start Time: Wed, 31 May 2023 15:07:32 -0400
Labels: app.kubernetes.io/component=backend
app.kubernetes.io/instance=loki
app.kubernetes.io/name=loki
app.kubernetes.io/part-of=memberlist
controller-revision-hash=loki-backend-5c8db54c8b
linkerd.io/control-plane-ns=linkerd
linkerd.io/proxy-statefulset=loki-backend
linkerd.io/workload-ns=loki
statefulset.kubernetes.io/pod-name=loki-backend-0
Annotations: checksum/config: cf83c3c210bb43ee1a6b348ac3e1a817508fa5aa3a753d47d5add77cf62e776d
linkerd.io/created-by: linkerd/proxy-injector stable-2.13.3
linkerd.io/inject: enabled
linkerd.io/proxy-version: stable-2.13.3
linkerd.io/trust-root-sha256: a74d25fec78ad055f32243a15a92f754746dcf72c23c64fc95fea4c2dd6c5227
viz.linkerd.io/tap-enabled: true
Status: Terminating (lasts 46m)
Termination Grace Period: 300s
IP: 10.16.2.167
IPs:
IP: 10.16.2.167
Controlled By: StatefulSet/loki-backend
Init Containers:
linkerd-init:
Container ID: containerd://a23ff938eaf1656bc04ea5028b5ab62821e199a6d1dfe9055d91f71bb1112a88
Image: cr.l5d.io/linkerd/proxy-init:v2.2.1
Image ID: cr.l5d.io/linkerd/proxy-init@sha256:20349a461f9fb76fde33741a90f9de2a647068f506325ac5e0faf7b7bc2eea72
Port: <none>
Host Port: <none>
Args:
--incoming-proxy-port
4143
--outgoing-proxy-port
4140
--proxy-uid
2102
--inbound-ports-to-ignore
4190,4191,4567,4568
--outbound-ports-to-ignore
4567,4568
--log-format
json
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 31 May 2023 15:07:34 -0400
Finished: Wed, 31 May 2023 15:07:35 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 20Mi
Requests:
cpu: 100m
memory: 20Mi
Environment:
AWS_STS_REGIONAL_ENDPOINTS: regional
AWS_DEFAULT_REGION: us-gov-west-1
AWS_REGION: us-gov-west-1
AWS_ROLE_ARN: arn:aws-us-gov:iam::125005550194:role/role-prod-loki
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/run from linkerd-proxy-init-xtables-lock (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nz4jn (ro)
Containers:
linkerd-proxy:
Container ID: containerd://5d2935fc77b591423182707a177181a03cd7c77cff0d12a4dcfb4f29de6c32da
Image: cr.l5d.io/linkerd/proxy:stable-2.13.3
Image ID: cr.l5d.io/linkerd/proxy@sha256:faed350dae1e4ffcba2f0676288c89557b376199c63eff2037bc780fed9d44c3
Ports: 4143/TCP, 4191/TCP
Host Ports: 0/TCP, 0/TCP
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 31 May 2023 15:07:36 -0400
Finished: Fri, 09 Jun 2023 09:03:23 -0400
Ready: False
Restart Count: 0
Requests:
cpu: 100m
memory: 20Mi
Liveness: http-get http://:4191/live delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:4191/ready delay=2s timeout=1s period=10s #success=1 #failure=3
Environment:
_pod_name: loki-backend-0 (v1:metadata.name)
_pod_ns: loki (v1:metadata.namespace)
_pod_nodeName: (v1:spec.nodeName)
LINKERD2_PROXY_LOG: warn,linkerd=info,trust_dns=error
LINKERD2_PROXY_LOG_FORMAT: json
LINKERD2_PROXY_DESTINATION_SVC_ADDR: linkerd-dst-headless.linkerd.svc.cluster.local.:8086
LINKERD2_PROXY_DESTINATION_PROFILE_NETWORKS: 10.0.0.0/8,100.64.0.0/10,172.16.0.0/12,192.168.0.0/16
LINKERD2_PROXY_POLICY_SVC_ADDR: linkerd-policy.linkerd.svc.cluster.local.:8090
LINKERD2_PROXY_POLICY_WORKLOAD: $(_pod_ns):$(_pod_name)
LINKERD2_PROXY_INBOUND_DEFAULT_POLICY: all-unauthenticated
LINKERD2_PROXY_POLICY_CLUSTER_NETWORKS: 10.0.0.0/8,100.64.0.0/10,172.16.0.0/12,192.168.0.0/16
LINKERD2_PROXY_INBOUND_CONNECT_TIMEOUT: 100ms
LINKERD2_PROXY_OUTBOUND_CONNECT_TIMEOUT: 1000ms
LINKERD2_PROXY_CONTROL_LISTEN_ADDR: 0.0.0.0:4190
LINKERD2_PROXY_ADMIN_LISTEN_ADDR: 0.0.0.0:4191
LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR: 127.0.0.1:4140
LINKERD2_PROXY_INBOUND_LISTEN_ADDR: 0.0.0.0:4143
LINKERD2_PROXY_INBOUND_IPS: (v1:status.podIPs)
LINKERD2_PROXY_INBOUND_PORTS: 3100,7946,9095
LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES: svc.cluster.local.
LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE: 10000ms
LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE: 10000ms
LINKERD2_PROXY_INBOUND_PORTS_DISABLE_PROTOCOL_DETECTION: 25,587,3306,4444,5432,6379,9300,11211
LINKERD2_PROXY_DESTINATION_CONTEXT: {"ns":"$(_pod_ns)", "nodeName":"$(_pod_nodeName)"}
_pod_sa: (v1:spec.serviceAccountName)
_l5d_ns: linkerd
_l5d_trustdomain: cluster.local
LINKERD2_PROXY_IDENTITY_DIR: /var/run/linkerd/identity/end-entity
LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS: -----BEGIN CERTIFICATE-----
MIIBizCCATGgAwIBAgIRAPqmcyoDmzFOkMveNhvVNC0wCgYIKoZIzj0EAwIwJTEj
MCEGA1UEAxMacm9vdC5saW5rZXJkLmNsdXN0ZXIubG9jYWwwHhcNMjMwNTAxMTMx
ODE4WhcNMjMwNzMwMTMxODE4WjAlMSMwIQYDVQQDExpyb290LmxpbmtlcmQuY2x1
c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABFX9i+cfgNAxIjnb
cAfJ7l3yIsMIjv0qu/p9RdGZUsK4aiDxKDgWitfqMBxIOBXnUeUROnsJE2pffgw9
KW/tZYWjQjBAMA4GA1UdDwEB/wQEAwICpDAPBgNVHRMBAf8EBTADAQH/MB0GA1Ud
DgQWBBTOOAjOqefPASSXYUiA0H6P8ZXrZjAKBggqhkjOPQQDAgNIADBFAiEAn5Nc
YwXPXkP8qtKJb2fYeVzGXN/zK57c6b5KVvN6w1YCIC0lu55Sz8Y6tpGmaC7QJMCc
BZcpsTE1FcX5fqpMKXSV
-----END CERTIFICATE-----
LINKERD2_PROXY_IDENTITY_TOKEN_FILE: /var/run/secrets/tokens/linkerd-identity-token
LINKERD2_PROXY_IDENTITY_SVC_ADDR: linkerd-identity-headless.linkerd.svc.cluster.local.:8080
LINKERD2_PROXY_IDENTITY_LOCAL_NAME: $(_pod_sa).$(_pod_ns).serviceaccount.identity.linkerd.cluster.local
LINKERD2_PROXY_IDENTITY_SVC_NAME: linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
LINKERD2_PROXY_DESTINATION_SVC_NAME: linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
LINKERD2_PROXY_POLICY_SVC_NAME: linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
LINKERD2_PROXY_TAP_SVC_NAME: tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local
AWS_STS_REGIONAL_ENDPOINTS: regional
AWS_DEFAULT_REGION: us-gov-west-1
AWS_REGION: us-gov-west-1
AWS_ROLE_ARN: arn:aws-us-gov:iam::125005550194:role/role-prod-loki
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/var/run/linkerd/identity/end-entity from linkerd-identity-end-entity (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nz4jn (ro)
/var/run/secrets/tokens from linkerd-identity-token (rw)
loki:
Container ID: containerd://5a6d8ee723a9e14e98afe418b6963ea9598da15b62e7f8821714249e75629cb5
Image: docker.io/grafana/loki:2.8.2
Image ID: docker.io/grafana/loki@sha256:b1da1d23037eb1b344cccfc5b587e30aed60ab4cad33b42890ff850aa3c4755d
Ports: 3100/TCP, 9095/TCP, 7946/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
-config.file=/etc/loki/config/config.yaml
-target=backend
-legacy-read-mode=false
State: Terminated
Reason: Error
Exit Code: 137
Started: Wed, 31 May 2023 15:07:39 -0400
Finished: Fri, 09 Jun 2023 09:06:23 -0400
Ready: False
Restart Count: 0
Requests:
cpu: 100m
memory: 128Mi
Readiness: http-get http://:http-metrics/ready delay=30s timeout=1s period=10s #success=1 #failure=3
Environment:
AWS_STS_REGIONAL_ENDPOINTS: regional
AWS_DEFAULT_REGION: us-gov-west-1
AWS_REGION: us-gov-west-1
AWS_ROLE_ARN: arn:aws-us-gov:iam::125005550194:role/role-prod-loki
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/etc/loki/config from config (rw)
/etc/loki/runtime-config from runtime-config (rw)
/tmp from tmp (rw)
/var/loki from data (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nz4jn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-loki-backend-0
ReadOnly: false
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: loki
Optional: false
runtime-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: loki-runtime
Optional: false
kube-api-access-nz4jn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
linkerd-proxy-init-xtables-lock:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
linkerd-identity-end-entity:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
linkerd-identity-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 51m kubelet Stopping container linkerd-proxy
Normal Killing 51m kubelet Stopping container loki
Warning Unhealthy 49m (x11 over 51m) kubelet Readiness probe failed: Get "http://10.16.2.167:3100/ready": dial tcp 10.16.2.167:3100: connect: connection refused
Warning Unhealthy 49m (x12 over 51m) kubelet Readiness probe failed: Get "http://10.16.2.167:4191/ready": dial tcp 10.16.2.167:4191: connect: connection refused
Wanted to add, we deployed memcache for read/chunks, and also added some port exclusions for linkerd.
End result was no change, same issue.
❯ kubectl get pods -n loki
NAME READY STATUS RESTARTS AGE
chunk-cache-memcached-66bb59d584-frt47 3/3 Running 0 4d3h
loki-backend-0 0/2 Terminating 0 5d22h
Last Helm logs:
resetting values to the chart's original version
performing update for loki
creating upgraded release for loki
waiting for release loki resources (created: 0 updated: 32 deleted: 0)
warning: Upgrade "loki" failed: context deadline exceeded
I have noticed the same issue.
Same here.
We migrated to autoscaling, and I have not hit this issue again. The autoscaling, comes with a lifecycle script, which I assume it working around the problem thats causing the normal deployment to hang.
https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L735
-- The default /flush_shutdown preStop hook is recommended as part of the ingester
scaledown process so it's added to the template by default when autoscaling is enabled,
but it's disabled to optimize rolling restarts in instances that will never be scaled
down or when using chunks storage with WAL disabled.
https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown
This is just a guess though.
any chance to get some official feedback or a possible workaround, I have currently disabled loki upgrades to automerge in renovate. This is not a solution :( for anyone interested in the config : https://github.com/tyriis/renovate-config/blob/main/flux/prevent-automerge-loki.json5
i am facing the above issue with grafana/loki-stack with hlem chart in AKS 1.25 version any help??
We're seeing similar issues here as well. Tagging some developers in hope of getting some traction or at least a response of some kind here.
@JStickler @chaudum @kavirajk
I have found a way to live with it, by accepting that a rollout with 3 backends will take ~20min to reconcile :(
i am facing the above issue with grafana/loki-stack with hlem chart in AKS 1.25 version any help??
some deep digging into that issue has been resolved..thanks
i am facing the above issue with grafana/loki-stack with hlem chart in AKS 1.25 version any help??
some deep digging into that issue has been resolved..thanks
Off-topic. The on-topic issue is not resolved.
I am facing this exact issue in EKS, any updates on how I can get my loki log pods to terminate on a helm uninstall?
I am facing this exact issue in EKS, any updates on how I can get my loki log pods to terminate on a helm uninstall?
hey @andrewgkew I am using flux, increasing the timeout for a helm release from 5 -> 15 min solved it for me (I use 3 replica) if you have more you need to find a timeout where each replica can be killed after wait + healthcheck + start time hope it helps.
I am facing this exact issue in EKS, any updates on how I can get my loki log pods to terminate on a helm uninstall?
hey @andrewgkew I am using flux, increasing the timeout for a helm release from 5 -> 15 min solved it for me (I use 3 replica) if you have more you need to find a timeout where each replica can be killed after wait + healthcheck + start time hope it helps.
@tyriis thanks for the tip. But this means my removing of the pods could take 15 minutes? Thats crazy. I assume this is just a work around? Any idea whats causing it and why Grafana are not fixing the issue?
@andrewgkew yeah ofc this is a workaround (and it does not scale) as this thread is not responded, I live with it for now
We also experience this issue, and it prevents rolling patches of the cluster because nodes cannot be drained properly. How can we help debug this?
This also seems to break single binary mode when a node shuts down and it corrupts all data. Loki just hangs for too long and then linux just kills the process. It really needs to respect that first sigterm it gets, both in backend mode and in singlebinary mode. Like you better flush and shutdown because you are going to get killed soon.
Can someone provides a goroutine dump when that happens please ?
curl http://localhost:<port>/debug/pprof/goroutine?debug=2
If it helps, we've upgraded to V3 and we haven't seen this issue since.
Another way to get the dumb is using kubectl exec
kubectl exec loki-logs-2wj2f -n monitoring —- kill -6 1
@cyriltovena I can confirm that this issue goes away with upgrading to v3 of Loki, thanks @elburnetto-intapp for the tip.
@uhthomas You can probably close this issue. I can confirm v3 doesn't have this issue either.
Describe the bug
The Loki Backend container will hang indefinitely when attempting to shutdown. The rollout only progresses due to the 5 minute grace period from Kubernetes.
To Reproduce Steps to reproduce the behavior:
Expected behavior It should gracefully shutdown in a timely manner.
Environment:
Screenshots, Promtail config, or terminal output
N/A - see above.