Closed dimitarvdimitrov closed 8 months ago
after sending another SIGTERM via kubectl delete pod
to the same pod, the store-gateway logged this and successfully stopped
``` ts=2024-02-20T18:34:47.510242666Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=1m39.684969953s id=01H94SF3H0G3BWPV58CHC90G2K ts=2024-02-20T18:35:28.612765576Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=5m53.154192184s id=01HMJM2V7AGHFJ55VR2GK0DP2D ts=2024-02-20T18:35:35.53787377Z caller=signals.go:62 level=info msg="=== received SIGINT/SIGTERM ===\n*** exiting" ts=2024-02-20T18:35:35.538314105Z caller=basic_lifecycler.go:238 level=info msg="ring lifecycler is shutting down" ring=store-gateway ts=2024-02-20T18:35:35.538904367Z caller=basic_lifecycler.go:272 level=info msg="keeping instance the ring" ring=store-gateway ts=2024-02-20T18:35:58.04726427Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=2m8.428667103s id=01HE71K9KJ53DD19A8M7CJR0ZJ ts=2024-02-20T18:35:59.15852666Z caller=bucket.go:405 level=error user=TENANT_ID msg="loading block failed" elapsed=30.545670847s id=01H62WS0AV9QBVH1Z112XX1B68 err="create index header reader: write index header: copy posting offsets: context canceled" ts=2024-02-20T18:36:01.64761372Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=1m36.716602529s id=01HJ00W1PB4V2D16N9X9QCBCKA ts=2024-02-20T18:36:06.87322684Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=1m19.362919762s id=01H9PRD4AJQTPH69D1TPH8J35Y ts=2024-02-20T18:36:07.409658823Z caller=bucket_stores.go:186 level=warn msg="failed to synchronize TSDB blocks" err="context canceled" ts=2024-02-20T18:36:07.409768038Z caller=mimir.go:870 level=error msg="module failed" module=store-gateway err="starting module store-gateway: context canceled" ts=2024-02-20T18:36:07.40978213Z caller=memberlist_client.go:720 level=info msg="leaving memberlist cluster" ts=2024-02-20T18:36:07.409811045Z caller=module_service.go:120 level=info msg="module stopped" module=runtime-config ts=2024-02-20T18:36:17.436149359Z caller=memberlist_client.go:735 level=warn msg="broadcast messages left in queue" count=4 nodes=438 ts=2024-02-20T18:36:18.078030881Z caller=module_service.go:120 level=info msg="module stopped" module=memberlist-kv ts=2024-02-20T18:36:18.078137597Z caller=module_service.go:120 level=info msg="module stopped" module=distributor-bi-push-wrapper ts=2024-02-20T18:36:24.4117936Z caller=server_service.go:55 level=info msg="server stopped" ts=2024-02-20T18:36:24.411843823Z caller=module_service.go:120 level=info msg="module stopped" module=server ts=2024-02-20T18:36:24.411900492Z caller=module_service.go:120 level=info msg="module stopped" module=usage-stats ts=2024-02-20T18:36:24.411910802Z caller=module_service.go:120 level=info msg="module stopped" module=sanity-check ts=2024-02-20T18:36:24.411939777Z caller=module_service.go:120 level=info msg="module stopped" module=license-manager ts=2024-02-20T18:36:24.413388804Z caller=module_service.go:120 level=info msg="module stopped" module=activity-tracker ts=2024-02-20T18:36:24.413423322Z caller=mimir.go:859 level=info msg="Application stopped" ts=2024-02-20T18:36:24.413536373Z caller=server_util.go:26 level=error original_caller=main.go:266 msg="error running Grafana Enterprise Metrics" err="failed services\ngithub.com/grafana/mimir/pkg/mimir.(*Mimir).Run\n\t/drone/src/vendor/github.com/grafana/m imir/pkg/mimir/mimir.go:913\ngithub.com/grafana/backend-enterprise/pkg/enterprise/mimir/init.(*MimirEnterprise).Run\n\t/drone/src/pkg/enterprise/mimir/init/mimir.go:196\nmain.main\n\t/drone/src/cmd/enterprise-metrics/main.go:263\nruntime.main\n\t$GOROOT/src /runtime/proc.go:267\nruntime.goexit\n\t$GOROOT/src/runtime/asm_amd64.s:1650" ```
the same store-gateway also deleted/unloaded its blocks upon the shutdown
Weird, reading the logs posted it seems like it shutdown after the first SIGTERM
but then started again after about 32s
and then received another SIGTERM
about a minute and 1m20s1
later. I guess it's possible that there's a race condition here but I can't quite see how.
I should've made this more clear. The posted logs in the original issue contain two SIGTERMs. The first at 17:21:22.793 is from a clean shutdown. The one at 17:23:28.702 was from the preemptive shutdown due to rescheduling.
The third shutdown from the comment 18:35:35.53787377Z is a third shutdown that I triggered with kubectl delete pod
on a second look it seems like the store-gateway actually started again at 17:24:11.719.
I wonder if the non-ready state was because it was loading blocks and not because it was stuck in shutdown loop. This might have been caused by a wiped persistent volume, in which case the store-gateway behaved alright.
as noted by @narqo privately the second shutdown in the original logs actually created a new pod with a new container ID. So it seems like the store-gateway shut down properly and started redownloading blocks. This may be explained by a wiped PVC, but I don't see an easy way this is a bug in Mimir. So i'll close this isssue
Describe the bug
The store-gateway doesn't stop if it received SIGTERM while starting up.
To Reproduce
The store-gateway pod received
SIGINT
about 30s after starting up while it was still in its startup procedure. The pod was being rescheduled. The pod keeps returning 503 onHTTP GET /ready
and it reports that thestore-gateway
service is still inStarting
state.Expected behavior
Store-gateway recognizes the SIGINT, aborts startup procedure and shuts down.
Environment
Additional Context
logs.txt (include one clean shutdown at 17:21:22.793 and one hanging at 17:23:28.702
exerpt:
Details
``` 2024-02-20 17:23:28.702 ts=2024-02-20T17:23:28.702464128Z caller=signals.go:62 level=info msg="=== received SIGINT/SIGTERM ===\n*** exiting" 2024-02-20 17:23:28.702 ts=2024-02-20T17:23:28.702876938Z caller=basic_lifecycler.go:238 level=info msg="ring lifecycler is shutting down" ring=store-gateway 2024-02-20 17:23:28.703 ts=2024-02-20T17:23:28.703412352Z caller=basic_lifecycler.go:272 level=info msg="keeping instance the ring" ring=store-gateway 2024-02-20 17:23:29.224 ts=2024-02-20T17:23:29.224734713Z caller=spanlogger.go:109 user=TENANT_ID method=indexheader.NewStreamBinaryReader level=info msg="loading from sparse index-header from disk" id=01HHXD7WKWFXMRGGWKW1WK6QRT path=/data/tsdb/TENANT_ID/01HHXD7WKWFXMRGGWKW1WK6QRT/sparse-index-header 2024-02-20 17:23:29.439 ts=2024-02-20T17:23:29.439053103Z caller=spanlogger.go:109 user=TENANT_ID method=indexheader.NewStreamBinaryReader level=info msg="loaded sparse index-header from disk" id=01HHXD7WKWFXMRGGWKW1WK6QRT path=/data/tsdb/TENANT_ID/01HHXD7WKWFXMRGGWKW1WK6QRT/sparse-index-header elapsed=214.323297ms 2024-02-20 17:23:31.041 ts=2024-02-20T17:23:31.041114247Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=6.596829693s id=01H5P2P8WPJCW2Q9SY21E9TGHB 2024-02-20 17:23:33.557 ts=2024-02-20T17:23:33.557610655Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=7.073626918s id=01H71VTNH5G4Q9966JTZ8HG2A8 2024-02-20 17:23:34.285 ts=2024-02-20T17:23:34.284952225Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=5.717122746s id=01HHXD7WKWFXMRGGWKW1WK6QRT 2024-02-20 17:23:34.825 ts=2024-02-20T17:23:34.825689226Z caller=bucket.go:407 level=info user=TENANT_ID msg="loaded new block" elapsed=8.085276076s id=01HC1K75ZFWS1JZG259X7NWVBK 2024-02-20 17:24:03.213 ts=2024-02-20T17:24:03.213297492Z caller=bucket_stores.go:186 level=warn msg="failed to synchronize TSDB blocks" err="context canceled" 2024-02-20 17:24:03.213 ts=2024-02-20T17:24:03.213420377Z caller=mimir.go:870 level=error msg="module failed" module=store-gateway err="starting module store-gateway: context canceled" 2024-02-20 17:24:03.213 ts=2024-02-20T17:24:03.213430039Z caller=module_service.go:120 level=info msg="module stopped" module=runtime-config 2024-02-20 17:24:03.213 ts=2024-02-20T17:24:03.213464897Z caller=memberlist_client.go:720 level=info msg="leaving memberlist cluster" 2024-02-20 17:24:05.156 ts=2024-02-20T17:24:05.1561227Z caller=module_service.go:120 level=info msg="module stopped" module=memberlist-kv 2024-02-20 17:24:05.156 ts=2024-02-20T17:24:05.156199619Z caller=module_service.go:120 level=info msg="module stopped" module=distributor-bi-push-wrapper 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.372919495Z caller=server_service.go:55 level=info msg="server stopped" 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.372956117Z caller=module_service.go:120 level=info msg="module stopped" module=server 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.373006148Z caller=module_service.go:120 level=info msg="module stopped" module=sanity-check 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.373016691Z caller=module_service.go:120 level=info msg="module stopped" module=usage-stats 2024-02-20 17:24:09.373 ts=2024-02-20T17:24:09.373031091Z caller=module_service.go:120 level=info msg="module stopped" module=license-manager 2024-02-20 17:24:09.374 ts=2024-02-20T17:24:09.374042889Z caller=module_service.go:120 level=info msg="module stopped" module=activity-tracker 2024-02-20 17:24:09.374 ts=2024-02-20T17:24:09.374087122Z caller=mimir.go:859 level=info msg="Application stopped" 2024-02-20 17:24:09.374 ts=2024-02-20T17:24:09.37417627Z caller=server_util.go:26 level=error original_caller=main.go:266 msg="error running Grafana Enterprise Metrics" err="failed services\ngithub.com/grafana/mimir/pkg/mimir.(*Mimir).Run\n\t/drone/src/vendor/github.com/grafana/mimir/pkg/mimir/mimir.go:913\ngithub.com/grafana/backend-enterprise/pkg/enterprise/mimir/init.(*MimirEnterprise).Run\n\t/drone/src/pkg/enterprise/mimir/init/mimir.go:196\nmain.main\n\t/drone/src/cmd/enterprise-metrics/main.go:263\nruntime.main\n\t$GOROOT/src/runtime/proc.go:267\nruntime.goexit\n\t$GOROOT/src/runtime/asm_amd64.s:1650" 2024-02-20 17:24:11.704 2024/02/20 17:24:11 proto: duplicate proto type registered: httpgrpc.HTTPRequest 2024-02-20 17:24:11.704 2024/02/20 17:24:11 proto: duplicate proto type registered: httpgrpc.HTTPResponse 2024-02-20 17:24:11.704 2024/02/20 17:24:11 proto: duplicate proto type registered: httpgrpc.Header 2024-02-20 17:24:11.719 ts=2024-02-20T17:24:11.719778345Z caller=main.go:262 level=info msg="starting Grafana Enterprise Metrics" version="(version=r276-b5a67b33, branch=weekly-r276, revision=b5a67b33, boring_crypto=false)" 2024-02-20 17:24:11.788 ts=2024-02-20T17:24:11.78816214Z caller=added_modules.go:197 level=info msg="management object storage client not starting, no backend configured" 2024-02-20 17:24:11.792 ts=2024-02-20T17:24:11.792507166Z caller=manager.go:168 level=warn msg="unable to load local license" error="license token file not found: ./license.jwt" 2024-02-20 17:24:11.801 ts=2024-02-20T17:24:11.800950159Z caller=filesystem.go:46 level=info msg="excluding mount points" mounts=^/(dev|proc|sys|var/lib/docker/.+)($|/) 2024-02-20 17:24:11.801 ts=2024-02-20T17:24:11.801047842Z caller=filesystem.go:49 level=info msg="excluding filesystem types" types=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$ 2024-02-20 17:24:11.801 ts=2024-02-20T17:24:11.801628369Z caller=quota.go:299 level=info msg="registered quota collector" 2024-02-20 17:24:11.801 ts=2024-02-20T17:24:11.801747233Z caller=added_modules.go:1075 level=warn msg="unable to load config extensions due to no valid license being found" 2024-02-20 17:24:11.802 ts=2024-02-20T17:24:11.802117924Z caller=server.go:338 level=info msg="server listening on addresses" http=[::]:80 grpc=[::]:9095 2024-02-20 17:24:11.872 ts=2024-02-20T17:24:11.872287439Z caller=added_modules.go:1261 level=warn msg="no admin client configured, instance-level API limits will be disabled" 2024-02-20 17:24:11.873 ts=2024-02-20T17:24:11.873447094Z caller=memberlist_client.go:435 level=info msg="Using memberlist cluster label and node name" cluster_label=prod-us-central-0.mimir-dedicated-30 node=store-gateway-zone-b-24-34eb7a21 2024-02-20 17:24:11.886 ts=2024-02-20T17:24:11.885840395Z caller=memberlist_client.go:541 level=info msg="memberlist fast-join starting" nodes_found=394 to_join=12 2024-02-20 17:24:11.950 ts=2024-02-20T17:24:11.950603157Z caller=remote.go:71 level=info msg="created remote index cache" 2024-02-20 17:24:11.957 ts=2024-02-20T17:24:11.957678754Z caller=module_service.go:82 level=info msg=starting module=sanity-check 2024-02-20 17:24:11.957 ts=2024-02-20T17:24:11.957785104Z caller=sanity_check.go:32 level=info msg="Checking directories read/write access" 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.957986081Z caller=module_service.go:82 level=info msg=starting module=usage-stats 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.958077717Z caller=module_service.go:82 level=info msg=starting module=distributor-bi-push-wrapper 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.958188733Z caller=module_service.go:82 level=info msg=starting module=activity-tracker 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.958385885Z caller=module_service.go:82 level=info msg=starting module=license-manager 2024-02-20 17:24:11.958 ts=2024-02-20T17:24:11.958454919Z caller=manager.go:130 level=info msg="license manager up and running" 2024-02-20 17:24:11.991 ts=2024-02-20T17:24:11.991664018Z caller=sanity_check.go:37 level=info msg="Directories read/write access successfully checked" 2024-02-20 17:24:11.991 ts=2024-02-20T17:24:11.991727069Z caller=sanity_check.go:39 level=info msg="Checking object storage config" 2024-02-20 17:24:12.090 ts=2024-02-20T17:24:12.089984236Z caller=reporter.go:132 level=info msg="usage stats reporter initialized" cluster_id=340956c3-9091-4559-a75b-18ecca8eebf3 2024-02-20 17:24:12.095 ts=2024-02-20T17:24:12.095415136Z caller=sanity_check.go:44 level=info msg="Object storage config successfully checked" 2024-02-20 17:24:12.095 ts=2024-02-20T17:24:12.095474867Z caller=module_service.go:82 level=info msg=starting module=server 2024-02-20 17:24:12.097 ts=2024-02-20T17:24:12.097304832Z caller=module_service.go:82 level=info msg=starting module=memberlist-kv 2024-02-20 17:24:12.097 ts=2024-02-20T17:24:12.097340564Z caller=module_service.go:82 level=info msg=starting module=runtime-config 2024-02-20 17:24:12.098 ts=2024-02-20T17:24:12.098685566Z caller=module_service.go:82 level=info msg=starting module=store-gateway 2024-02-20 17:24:13.000 ts=2024-02-20T17:24:12.999901464Z caller=memberlist_client.go:561 level=info msg="memberlist fast-join finished" joined_nodes=12 elapsed_time=1.125417476s 2024-02-20 17:24:13.000 ts=2024-02-20T17:24:12.999964979Z caller=memberlist_client.go:573 level=info phase=startup msg="joining memberlist cluster" join_members=dns+gossip-ring.mimir-dedicated-30.svc.cluster.local.:7946 2024-02-20 17:24:13.000 ts=2024-02-20T17:24:13.000132316Z caller=basic_lifecycler.go:295 level=info msg="instance found in the ring" instance=store-gateway-zone-b-24 ring=store-gateway state=LEAVING tokens=512 registered_at="2023-06-03 01:45:54 +0000 UTC" 2024-02-20 17:24:13.000 ts=2024-02-20T17:24:13.000173702Z caller=basic_lifecycler_delegates.go:69 level=info msg="not loading tokens from file, instance already in the ring" 2024-02-20 17:24:13.018 ts=2024-02-20T17:24:13.018545627Z caller=gateway.go:216 level=info msg="waiting until store-gateway is JOINING in the ring" 2024-02-20 17:24:13.524 ts=2024-02-20T17:24:13.524814713Z caller=gateway.go:220 level=info msg="store-gateway is JOINING in the ring" 2024-02-20 17:24:13.524 ts=2024-02-20T17:24:13.524863817Z caller=gateway.go:230 level=info msg="waiting until store-gateway ring topology is stable" min_waiting=1m0s max_waiting=5m0s 2024-02-20 17:24:47.812 ts=2024-02-20T17:24:47.812260079Z caller=memberlist_client.go:580 level=info phase=startup msg="joining memberlist cluster succeeded" reached_nodes=394 elapsed_time=34.812280636s 2024-02-20 17:25:20.525 ts=2024-02-20T17:25:20.525196144Z caller=gateway.go:234 level=info msg="store-gateway ring topology is stable" 2024-02-20 17:25:20.525 ts=2024-02-20T17:25:20.5252633Z caller=bucket_stores.go:181 level=info msg="synchronizing TSDB blocks for all users" 2024-02-20 17:25:20.526 ts=2024-02-20T17:25:20.526740526Z caller=bucket_stores.go:434 level=info user=TENANT_ID msg="creating user bucket store" 2024-02-20 17:25:20.612 ts=2024-02-20T17:25:20.612647394Z caller=bucket_stores.go:434 level=info user=load-generator-1 msg="creating user bucket store" ```