grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.14k stars 531 forks source link

store-gateway: keep running after receiving shutdown signal during startup #7433

Closed dimitarvdimitrov closed 8 months ago

dimitarvdimitrov commented 8 months ago

Describe the bug

The store-gateway doesn't stop if it received SIGTERM while starting up.

To Reproduce

The store-gateway pod received SIGINT about 30s after starting up while it was still in its startup procedure. The pod was being rescheduled. The pod keeps returning 503 on HTTP GET /ready and it reports that the store-gateway service is still in Starting state.

Expected behavior

Store-gateway recognizes the SIGINT, aborts startup procedure and shuts down.

Environment

Additional Context

logs.txt (include one clean shutdown at 17:21:22.793 and one hanging at 17:23:28.702

exerpt:

Details

``` 2024-02-20 17:23:28.702 ts=2024-02-20T17:23:28.702464128Z caller=signals.go:62 level=info msg="=== received SIGINT/SIGTERM ===\n*** exiting" 2024-02-20 17:23:28.702 ts=2024-02-20T17:23:28.702876938Z caller=basic_lifecycler.go:238 level=info msg="ring lifecycler is shutting down" ring=store-gateway 2024-02-20 17:23:28.703 ts=2024-02-20T17:23:28.703412352Z caller=basic_lifecycler.go:272 level=info msg="keeping instance the ring" ring=store-gateway 2024-02-20 17:23:29.224 ts=2024-02-20T17:23:29.224734713Z caller=spanlogger.go:109 user=TENANT_ID method=indexheader.NewStreamBinaryReader level=info msg="loading from sparse index-header from disk" id=01HHXD7WKWFXMRGGWKW1WK6QRT path=/data/tsdb/TENANT_ID/01HHXD7WKWFXMRGGWKW1WK6QRT/sparse-index-header 2024-02-20 17:23:29.439 ts=2024-02-20T17:23:29.439053103Z caller=spanlogger.go:109 user=TENANT_ID method=indexheader.NewStreamBinaryReader level=info msg="loaded sparse index-header from disk" id=01HHXD7WKWFXMRGGWKW1WK6QRT path=/data/tsdb/TENANT_ID/01HHXD7WKWFXMRGGWKW1WK6QRT/sparse-index-header elapsed=214.323297ms 2024-02-20 17:23:31.041 ts=2024-02-20T17:23:31.041114247Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=6.596829693s id=01H5P2P8WPJCW2Q9SY21E9TGHB 2024-02-20 17:23:33.557 ts=2024-02-20T17:23:33.557610655Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=7.073626918s id=01H71VTNH5G4Q9966JTZ8HG2A8 2024-02-20 17:23:34.285 ts=2024-02-20T17:23:34.284952225Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=5.717122746s id=01HHXD7WKWFXMRGGWKW1WK6QRT 2024-02-20 17:23:34.825 ts=2024-02-20T17:23:34.825689226Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=8.085276076s id=01HC1K75ZFWS1JZG259X7NWVBK 2024-02-20 17:24:03.213 ts=2024-02-20T17:24:03.213297492Z caller=bucket_stores.go:186 level=warn msg="failed to synchronize TSDB blocks" err="context canceled" 2024-02-20 17:24:03.213 ts=2024-02-20T17:24:03.213420377Z caller=mimir.go:870 level=error msg="module failed" module=store-gateway err="starting module store-gateway: context canceled" 2024-02-20 17:24:03.213 ts=2024-02-20T17:24:03.213430039Z caller=module_service.go:120 level=info msg="module stopped" module=runtime-config 2024-02-20 17:24:03.213 ts=2024-02-20T17:24:03.213464897Z caller=memberlist_client.go:720 level=info msg="leaving memberlist cluster" 2024-02-20 17:24:05.156 ts=2024-02-20T17:24:05.1561227Z caller=module_service.go:120 level=info msg="module stopped" module=memberlist-kv 2024-02-20 17:24:05.156 ts=2024-02-20T17:24:05.156199619Z caller=module_service.go:120 level=info msg="module stopped" module=distributor-bi-push-wrapper 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.372919495Z caller=server_service.go:55 level=info msg="server stopped" 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.372956117Z caller=module_service.go:120 level=info msg="module stopped" module=server 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.373006148Z caller=module_service.go:120 level=info msg="module stopped" module=sanity-check 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.373016691Z caller=module_service.go:120 level=info msg="module stopped" module=usage-stats 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.373031091Z caller=module_service.go:120 level=info msg="module stopped" module=license-manager 2024-02-20 17:24:09.374 ts=2024-02-20T17:24:09.374042889Z caller=module_service.go:120 level=info msg="module stopped" module=activity-tracker 2024-02-20 17:24:09.374 ts=2024-02-20T17:24:09.374087122Z caller=mimir.go:859 level=info msg="Application stopped" 2024-02-20 17:24:09.374 ts=2024-02-20T17:24:09.37417627Z caller=server_util.go:26 level=error original_caller=main.go:266 msg="error running Grafana Enterprise Metrics" err="failed services\ngithub.com/grafana/mimir/pkg/mimir.(*Mimir).Run\n\t/drone/src/vendor/github.com/grafana/mimir/pkg/mimir/mimir.go:913\ngithub.com/grafana/backend-enterprise/pkg/enterprise/mimir/init.(*MimirEnterprise).Run\n\t/drone/src/pkg/enterprise/mimir/init/mimir.go:196\nmain.main\n\t/drone/src/cmd/enterprise-metrics/main.go:263\nruntime.main\n\t$GOROOT/src/runtime/proc.go:267\nruntime.goexit\n\t$GOROOT/src/runtime/asm_amd64.s:1650" 2024-02-20 17:24:11.704 2024/02/20 17:24:11 proto: duplicate proto type registered: httpgrpc.HTTPRequest 2024-02-20 17:24:11.704 2024/02/20 17:24:11 proto: duplicate proto type registered: httpgrpc.HTTPResponse 2024-02-20 17:24:11.704 2024/02/20 17:24:11 proto: duplicate proto type registered: httpgrpc.Header 2024-02-20 17:24:11.719 ts=2024-02-20T17:24:11.719778345Z caller=main.go:262 level=info msg="starting Grafana Enterprise Metrics" version="(version=r276-b5a67b33, branch=weekly-r276, revision=b5a67b33, boring_crypto=false)" 2024-02-20 17:24:11.788 ts=2024-02-20T17:24:11.78816214Z caller=added_modules.go:197 level=info msg="management object storage client not starting, no backend configured" 2024-02-20 17:24:11.792 ts=2024-02-20T17:24:11.792507166Z caller=manager.go:168 level=warn msg="unable to load local license" error="license token file not found: ./license.jwt" 2024-02-20 17:24:11.801 ts=2024-02-20T17:24:11.800950159Z caller=filesystem.go:46 level=info msg="excluding mount points" mounts=^/(dev|proc|sys|var/lib/docker/.+)($|/) 2024-02-20 17:24:11.801 ts=2024-02-20T17:24:11.801047842Z caller=filesystem.go:49 level=info msg="excluding filesystem types" types=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$ 2024-02-20 17:24:11.801 ts=2024-02-20T17:24:11.801628369Z caller=quota.go:299 level=info msg="registered quota collector" 2024-02-20 17:24:11.801 ts=2024-02-20T17:24:11.801747233Z caller=added_modules.go:1075 level=warn msg="unable to load config extensions due to no valid license being found" 2024-02-20 17:24:11.802 ts=2024-02-20T17:24:11.802117924Z caller=server.go:338 level=info msg="server listening on addresses" http=[::]:80 grpc=[::]:9095 2024-02-20 17:24:11.872 ts=2024-02-20T17:24:11.872287439Z caller=added_modules.go:1261 level=warn msg="no admin client configured, instance-level API limits will be disabled" 2024-02-20 17:24:11.873 ts=2024-02-20T17:24:11.873447094Z caller=memberlist_client.go:435 level=info msg="Using memberlist cluster label and node name" cluster_label=prod-us-central-0.mimir-dedicated-30 node=store-gateway-zone-b-24-34eb7a21 2024-02-20 17:24:11.886 ts=2024-02-20T17:24:11.885840395Z caller=memberlist_client.go:541 level=info msg="memberlist fast-join starting" nodes_found=394 to_join=12 2024-02-20 17:24:11.950 ts=2024-02-20T17:24:11.950603157Z caller=remote.go:71 level=info msg="created remote index cache" 2024-02-20 17:24:11.957 ts=2024-02-20T17:24:11.957678754Z caller=module_service.go:82 level=info msg=starting module=sanity-check 2024-02-20 17:24:11.957 ts=2024-02-20T17:24:11.957785104Z caller=sanity_check.go:32 level=info msg="Checking directories read/write access" 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.957986081Z caller=module_service.go:82 level=info msg=starting module=usage-stats 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.958077717Z caller=module_service.go:82 level=info msg=starting module=distributor-bi-push-wrapper 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.958188733Z caller=module_service.go:82 level=info msg=starting module=activity-tracker 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.958385885Z caller=module_service.go:82 level=info msg=starting module=license-manager 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.958454919Z caller=manager.go:130 level=info msg="license manager up and running" 2024-02-20 17:24:11.991 ts=2024-02-20T17:24:11.991664018Z caller=sanity_check.go:37 level=info msg="Directories read/write access successfully checked" 2024-02-20 17:24:11.991 ts=2024-02-20T17:24:11.991727069Z caller=sanity_check.go:39 level=info msg="Checking object storage config" 2024-02-20 17:24:12.090 ts=2024-02-20T17:24:12.089984236Z caller=reporter.go:132 level=info msg="usage stats reporter initialized" cluster_id=340956c3-9091-4559-a75b-18ecca8eebf3 2024-02-20 17:24:12.095 ts=2024-02-20T17:24:12.095415136Z caller=sanity_check.go:44 level=info msg="Object storage config successfully checked" 2024-02-20 17:24:12.095 ts=2024-02-20T17:24:12.095474867Z caller=module_service.go:82 level=info msg=starting module=server 2024-02-20 17:24:12.097 ts=2024-02-20T17:24:12.097304832Z caller=module_service.go:82 level=info msg=starting module=memberlist-kv 2024-02-20 17:24:12.097 ts=2024-02-20T17:24:12.097340564Z caller=module_service.go:82 level=info msg=starting module=runtime-config 2024-02-20 17:24:12.098 ts=2024-02-20T17:24:12.098685566Z caller=module_service.go:82 level=info msg=starting module=store-gateway 2024-02-20 17:24:13.000 ts=2024-02-20T17:24:12.999901464Z caller=memberlist_client.go:561 level=info msg="memberlist fast-join finished" joined_nodes=12 elapsed_time=1.125417476s 2024-02-20 17:24:13.000 ts=2024-02-20T17:24:12.999964979Z caller=memberlist_client.go:573 level=info phase=startup msg="joining memberlist cluster" join_members=dns+gossip-ring.mimir-dedicated-30.svc.cluster.local.:7946 2024-02-20 17:24:13.000 ts=2024-02-20T17:24:13.000132316Z caller=basic_lifecycler.go:295 level=info msg="instance found in the ring" instance=store-gateway-zone-b-24 ring=store-gateway state=LEAVING tokens=512 registered_at="2023-06-03 01:45:54 +0000 UTC" 2024-02-20 17:24:13.000 ts=2024-02-20T17:24:13.000173702Z caller=basic_lifecycler_delegates.go:69 level=info msg="not loading tokens from file, instance already in the ring" 2024-02-20 17:24:13.018 ts=2024-02-20T17:24:13.018545627Z caller=gateway.go:216 level=info msg="waiting until store-gateway is JOINING in the ring" 2024-02-20 17:24:13.524 ts=2024-02-20T17:24:13.524814713Z caller=gateway.go:220 level=info msg="store-gateway is JOINING in the ring" 2024-02-20 17:24:13.524 ts=2024-02-20T17:24:13.524863817Z caller=gateway.go:230 level=info msg="waiting until store-gateway ring topology is stable" min_waiting=1m0s max_waiting=5m0s 2024-02-20 17:24:47.812 ts=2024-02-20T17:24:47.812260079Z caller=memberlist_client.go:580 level=info phase=startup msg="joining memberlist cluster succeeded" reached_nodes=394 elapsed_time=34.812280636s 2024-02-20 17:25:20.525 ts=2024-02-20T17:25:20.525196144Z caller=gateway.go:234 level=info msg="store-gateway ring topology is stable" 2024-02-20 17:25:20.525 ts=2024-02-20T17:25:20.5252633Z caller=bucket_stores.go:181 level=info msg="synchronizing TSDB blocks for all users" 2024-02-20 17:25:20.526 ts=2024-02-20T17:25:20.526740526Z caller=bucket_stores.go:434 level=info user=TENANT_ID msg="creating user bucket store" 2024-02-20 17:25:20.612 ts=2024-02-20T17:25:20.612647394Z caller=bucket_stores.go:434 level=info user=load-generator-1 msg="creating user bucket store" ```

dimitarvdimitrov commented 8 months ago

after sending another SIGTERM via kubectl delete pod to the same pod, the store-gateway logged this and successfully stopped

Logs

``` ts=2024-02-20T18:34:47.510242666Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=1m39.684969953s id=01H94SF3H0G3BWPV58CHC90G2K ts=2024-02-20T18:35:28.612765576Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=5m53.154192184s id=01HMJM2V7AGHFJ55VR2GK0DP2D ts=2024-02-20T18:35:35.53787377Z caller=signals.go:62 level=info msg="=== received SIGINT/SIGTERM ===\n*** exiting" ts=2024-02-20T18:35:35.538314105Z caller=basic_lifecycler.go:238 level=info msg="ring lifecycler is shutting down" ring=store-gateway ts=2024-02-20T18:35:35.538904367Z caller=basic_lifecycler.go:272 level=info msg="keeping instance the ring" ring=store-gateway ts=2024-02-20T18:35:58.04726427Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=2m8.428667103s id=01HE71K9KJ53DD19A8M7CJR0ZJ ts=2024-02-20T18:35:59.15852666Z caller=bucket.go:405 level=error user=TENANT_ID msg="loading block failed" elapsed=30.545670847s id=01H62WS0AV9QBVH1Z112XX1B68 err="create index header reader: write index header: copy posting offsets: context canceled" ts=2024-02-20T18:36:01.64761372Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=1m36.716602529s id=01HJ00W1PB4V2D16N9X9QCBCKA ts=2024-02-20T18:36:06.87322684Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=1m19.362919762s id=01H9PRD4AJQTPH69D1TPH8J35Y ts=2024-02-20T18:36:07.409658823Z caller=bucket_stores.go:186 level=warn msg="failed to synchronize TSDB blocks" err="context canceled" ts=2024-02-20T18:36:07.409768038Z caller=mimir.go:870 level=error msg="module failed" module=store-gateway err="starting module store-gateway: context canceled" ts=2024-02-20T18:36:07.40978213Z caller=memberlist_client.go:720 level=info msg="leaving memberlist cluster" ts=2024-02-20T18:36:07.409811045Z caller=module_service.go:120 level=info msg="module stopped" module=runtime-config ts=2024-02-20T18:36:17.436149359Z caller=memberlist_client.go:735 level=warn msg="broadcast messages left in queue" count=4 nodes=438 ts=2024-02-20T18:36:18.078030881Z caller=module_service.go:120 level=info msg="module stopped" module=memberlist-kv ts=2024-02-20T18:36:18.078137597Z caller=module_service.go:120 level=info msg="module stopped" module=distributor-bi-push-wrapper ts=2024-02-20T18:36:24.4117936Z caller=server_service.go:55 level=info msg="server stopped" ts=2024-02-20T18:36:24.411843823Z caller=module_service.go:120 level=info msg="module stopped" module=server ts=2024-02-20T18:36:24.411900492Z caller=module_service.go:120 level=info msg="module stopped" module=usage-stats ts=2024-02-20T18:36:24.411910802Z caller=module_service.go:120 level=info msg="module stopped" module=sanity-check ts=2024-02-20T18:36:24.411939777Z caller=module_service.go:120 level=info msg="module stopped" module=license-manager ts=2024-02-20T18:36:24.413388804Z caller=module_service.go:120 level=info msg="module stopped" module=activity-tracker ts=2024-02-20T18:36:24.413423322Z caller=mimir.go:859 level=info msg="Application stopped" ts=2024-02-20T18:36:24.413536373Z caller=server_util.go:26 level=error original_caller=main.go:266 msg="error running Grafana Enterprise Metrics" err="failed services\ngithub.com/grafana/mimir/pkg/mimir.(*Mimir).Run\n\t/drone/src/vendor/github.com/grafana/m imir/pkg/mimir/mimir.go:913\ngithub.com/grafana/backend-enterprise/pkg/enterprise/mimir/init.(*MimirEnterprise).Run\n\t/drone/src/pkg/enterprise/mimir/init/mimir.go:196\nmain.main\n\t/drone/src/cmd/enterprise-metrics/main.go:263\nruntime.main\n\t$GOROOT/src /runtime/proc.go:267\nruntime.goexit\n\t$GOROOT/src/runtime/asm_amd64.s:1650" ```

dimitarvdimitrov commented 8 months ago

the same store-gateway also deleted/unloaded its blocks upon the shutdown

Screenshot 2024-02-20 at 20 04 15
56quarters commented 8 months ago

Weird, reading the logs posted it seems like it shutdown after the first SIGTERM but then started again after about 32s and then received another SIGTERM about a minute and 1m20s1 later. I guess it's possible that there's a race condition here but I can't quite see how.

dimitarvdimitrov commented 8 months ago

I should've made this more clear. The posted logs in the original issue contain two SIGTERMs. The first at 17:21:22.793 is from a clean shutdown. The one at 17:23:28.702 was from the preemptive shutdown due to rescheduling.

The third shutdown from the comment 18:35:35.53787377Z is a third shutdown that I triggered with kubectl delete pod

dimitarvdimitrov commented 8 months ago

on a second look it seems like the store-gateway actually started again at 17:24:11.719.

I wonder if the non-ready state was because it was loading blocks and not because it was stuck in shutdown loop. This might have been caused by a wiped persistent volume, in which case the store-gateway behaved alright.

dimitarvdimitrov commented 8 months ago

as noted by @narqo privately the second shutdown in the original logs actually created a new pod with a new container ID. So it seems like the store-gateway shut down properly and started redownloading blocks. This may be explained by a wiped PVC, but I don't see an easy way this is a bug in Mimir. So i'll close this isssue