Loki ingester stays in LEAVING state after node reboot

amkartashov commented 4 years ago

Describe the bug

After/during node reboot (preemptible node in GKE, reboot time is limited to ~30 seconds), ingester pod goes into LEAVING state forever, until the pod is deleted manually.

To Reproduce

Deploy with helm on preempt nodes. Every time node is preempted, one of the ingesters goes to LEAVING and keeps running in this state (/ready return Ready, so pod is marked as healthy and ready).

Expected behavior

Ingester should either heal itself or exit after unclean restart and finding previous state LEAVING

Environment:

Infrastructure: GKE, preempt nodes n1-standard-2
Deployment tool: helm

Deployment is done from loki helm chart 0.25.1

Loki version string: version=v1.3.0, branch=HEAD, revision=5e527a03

Helm values file:

loki:
  extraArgs:
    log.level: debug
  serviceMonitor:
    enabled: true
    interval: 10s
    scrapeTimeout: 5s
  config:
    schema_config:
      configs:
      - from: "2019-07-01"
        store: cassandra
        object_store: cassandra
        schema: v10
        index:
          prefix: index
          period: 360h
        chunks:
          prefix: chunk
          period: 360h
    storage_config:
      cassandra:
        username: cassandra
        password: cassandra
        addresses: cassandra
        auth: true
        keyspace: loki
    ingester:
      lifecycler:
        ring:
          replication_factor: 2
          kvstore:
            store: etcd
            etcd:
              endpoints:
              - etcd:2379
  nameOverride: loki
  fullnameOverride: loki
  replicas: 5
  resources:
    limits:
      cpu: 500m
      #memory: 256Mi
    requests:
      cpu: 100m
      memory: 128Mi
  persistence:
    enabled: true
    size: 10Gi
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: cloud.google.com/gke-preemptible
            operator: Exists
        weight: 100
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: loki
        topologyKey: "kubernetes.io/hostname"

Promtail(daemonset)/cassandra(sts)/etcd(sts) are deployed separately to the same preempt nodes and experience no issues, except promtail not able to push logs to Loki after 3/5 of them go to LEAVING.

Screenshots, Promtail config, or terminal output

Discussion in slack: https://grafana.slack.com/archives/CEPJRLQNL/p1584242451454400

Logs for starting Loki instance after node reboot:

level=info ts=2020-03-15T02:20:21.386901677Z caller=loki.go:149 msg=initialising module=server
level=info ts=2020-03-15T02:20:21.387793334Z caller=server.go:121 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2020-03-15T02:20:21.388305543Z caller=loki.go:149 msg=initialising module=overrides
level=info ts=2020-03-15T02:20:21.388499931Z caller=override.go:53 msg="per-tenant overrides disabled"
level=info ts=2020-03-15T02:20:21.388729433Z caller=loki.go:149 msg=initialising module=table-manager
level=info ts=2020-03-15T02:20:21.636389303Z caller=loki.go:149 msg=initialising module=store
level=info ts=2020-03-15T02:20:21.639229202Z caller=table_manager.go:220 msg="synching tables" expected_tables=36
...
level=info ts=2020-03-15T02:20:22.067044325Z caller=loki.go:149 msg=initialising module=ring
level=info ts=2020-03-15T02:20:22.067743921Z caller=loki.go:149 msg=initialising module=querier
level=debug ts=2020-03-15T02:20:22.067846777Z caller=modules.go:166 msg="initializing querier worker" config="{Address: Parallelism:10 DNSLookupDuration:10s GRPCClientConfig:{MaxRecvMsgSize:104857600 MaxSendMsgSize:16777216 UseGzipCompression:false RateLimit:0 RateLimitBurst:0 BackoffOnRatelimits:false BackoffConfig:{MinBackoff:100ms MaxBackoff:10s MaxRetries:10}}}"
level=info ts=2020-03-15T02:20:22.067909526Z caller=worker.go:72 msg="no address specified, not starting worker"
level=info ts=2020-03-15T02:20:22.086818825Z caller=loki.go:149 msg=initialising module=ingester
level=info ts=2020-03-15T02:20:22.087478269Z caller=loki.go:149 msg=initialising module=distributor
level=info ts=2020-03-15T02:20:22.087612976Z caller=loki.go:149 msg=initialising module=all
level=info ts=2020-03-15T02:20:22.087631552Z caller=main.go:70 msg="Starting Loki" version="(version=v1.3.0, branch=HEAD, revision=5e527a03)"
level=info ts=2020-03-15T02:20:22.089785918Z caller=lifecycler.go:471 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2020-03-15T02:20:22.100261899Z caller=lifecycler.go:505 msg="existing entry found in ring" state=LEAVING tokens=128 ring=ingester
level=debug ts=2020-03-15T02:20:22.100369162Z caller=lifecycler.go:359 msg="JoinAfter expired" ring=ingester
level=debug ts=2020-03-15T02:20:26.168712738Z caller=logging.go:44 traceID=17132fd9b604a73b msg="GET /metrics (200) 27.310659ms"
level=debug ts=2020-03-15T02:20:32.345468852Z caller=logging.go:44 traceID=74f88119023ce136 msg="GET /metrics (200) 9.792214ms"

Problem might come from cortex not having a path out from LEAVING state on startup: https://github.com/cortexproject/cortex/blob/3374e33bde0e805f9955fc8ba4caf0a4b96e04ca/pkg/ring/lifecycler.go#L363

cyriltovena commented 4 years ago

So this is something we do experience from time to time. The workaround for now is to forget the leaving ingester.

I want to also point out that the helm chart wasn't meant to be scaled, for that we prefer to recommend tanka production/ksonnet.

cyriltovena commented 4 years ago

Also let's ask our ring expert @pstibrany.

pstibrany commented 4 years ago

Are you running ingesters on those preemptible nodes? From what I'm reading, I would strongly recommend not doing that. Ingesters take a while to shutdown correctly (during rollout of new version, new ingester is starting while another one is stopping, and stopping ingester then tries to handover its data to starting one). You can disable handover by using -ingester.max-transfer-retries=0 (or max_transfer_retries field under ingester in YAML). This will tell ingester to flush data to storage only. Note that it can still take a while.

If ingester is left in ring in LEAVING state, it means that it has failed to perform its shutdown correctly – and then it's best to find out why. When running on preemptible nodes, it is very likely that ingester has been killed before it was able to handover data or flush it properly, since pods on preemptible are not shutdown gracefully. This is bad, and can lead to data loss.

amkartashov commented 4 years ago

@pstibrany yep, it's preemptible pool. But:

Preemptible is not something unusual these days
Regular nodes dies unexpectedly as well, ingester might be OOMKilled or fail some other way
Only ingester is suffering on preemptible nodes. Other loki components are good, etcd and cassandra survive outage of preemptible nodes

Anyway, skipping arguing about preempt pros and cons, do you think staying in LEAVING forever is a normal behavior? If I delete the pod which is in LEAVING, it shuts down gracefully and goes ACTIVE after restart, why it does not shut itself down after some meaningful timeout while in LEAVING? I don't see the need in manual action here.

In other words, I'm OK with partial data loss (logs from single node due to unexpected shutdown) - that's why I use preempt nodes. I'm not OK with application requiring manual restart to heal - data loss already happened, why wait further?

amkartashov commented 4 years ago

So this is something we do experience from time to time. The workaround for now is to forget the leaving ingester.

Forget does not work for me, node just reappears LEAVING in a matter of seconds

pstibrany commented 4 years ago

If ingester is OOMkilled, that's bad and can result in loss of data. That's also why we use replication. Other Loki components are not stateful (well, query frontends keep queue of queries, but if you lose queries, it's not a big deal). Ingesters keep lot of data in memory, and need proper shutdown.

It's interesting that you point out that ingester reappears with LEAVING State... it means it's running. I think that's because when new ingester started, it simply picked up state that was in the ring (LEAVING) and is now running with this state. You can confirm by checking for existing entry found in ring entry in the log.

I think best what you can do is to 1) stop this ingester, 2) forget it from the ring if it is still there after stopping, 3) restart it agan.

In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.

amkartashov commented 4 years ago

it means it's running

It is. The thing is that I'm using 'all in one' image, not sure how it'll work in case of single ingester process.

In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.

@pstibrany thanks Peter! Looking for it.

pstibrany commented 4 years ago

For the reference, here is relevant change: https://github.com/cortexproject/cortex/pull/1921.

eraac commented 4 years ago

In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.

Sounds good, any date for the next release?

pstibrany commented 4 years ago

Sounds good, any date for the next release?

real soon now, I think 😉

/cc @slim-bean

amkartashov commented 4 years ago

UPDATE: unfortunately, I was testing it on non-preemptible nodes by mistake. On preemptible it keeps failing even with -ingester.max-transfer-retries=0 :( I'm moving this to non-preemptible nodes.

@pstibrany, I'm testing below w/a for a few days now, looks like it works - I haven't seen ingester in the LEAVING state since changing configuration. Looks like 30 seconds (preemption time) is enough for ingester to flush data to storage.

You can disable handover by using -ingester.max-transfer-retries=0 (or max_transfer_retries field under ingester in YAML). This will tell ingester to flush data to storage only. Note that it can still take a while.

stvnwrgs commented 4 years ago

This behavior can also be forced by sending a SIGKILL or deleting pods with grace-period=0.

Maybe it makes sense to test this in a chaoskube way?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

eraac commented 4 years ago

The problem is still here (non-preemptible machine isn't a fix)

Whyeasy commented 4 years ago

So I would like to add something regarding the SIGTERM / SIGKILL with GKE preemptible Nodes. The explanation done by the docs about preemptibles is for "normal" instances and not for workers.

It seems GKE is not picking up this signal to terminate pods gradually. So what happens is that the pod is instantly killed, which makes -ingester.max-transfer-retries=0 unusable. There is no time for any retry or whatsoever.

To be able to notify your pods of termination, you will need to following Daemonset running on your preemtibles.

I just installed it and didn't do any tests yet, but just wanted to inform you 😄

eraac commented 4 years ago

I've experiment k8s-node-termination-handler, it's better but still need some tests (and time) for ensure that ""solve"" the problem

Just fews information for make this "working"

I configure termination-handler to maximize graceful shutdown for regular pod (--system-pod-grace-period=0s) -> give 30sec for graceful shutdown
Set server.graceful_shutdown_timeout=5s, ingester.lifecycler.final_sleep=0s and ingester.max_transfer_retries=20

why max_transfer_retries > 20 ? Because is faster to transfer chunk to the new ingester than to flush to the store (on my side, it was impossible to flush in less than 30sec)
final_sleep -> care here, if you expose prometheus metrics you can lost it

FIY, when SIGTERM happens ingester do (in this order)

set LEAVING state in the ring
if max_transfer_retries > 0 -> try to transfer chunk to the new ingester
if max_transfer_retries = 0 -> flush chunk to the store (gcs, s3, ...)
if max_transfer_retries > 0 and fail -> then flush to the store
wait ingester.lifecycler.final_sleep second
remove ingester from the ring

But it's important to mention that work only in best case scenario, if node crash or shutdown is longer than 30sec -> ingester will be stuck in unhealthy state

EDIT: in the last 1h, 4 ingesters was stuck in unhealthy state, I guess this isn't a viable solution ...

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

grafana / loki

Loki ingester stays in LEAVING state after node reboot #1806