thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.98k stars 2.08k forks source link

Thanos receive fails "no space left on device" #7391

Open kbajy opened 3 months ago

kbajy commented 3 months ago

Thanos, Prometheus and Golang version used: v0.35.0 and Prometheus v2.48.0

Object Storage Provider: Azure Blob

What happened: The receive pod run couple of days without errors, then it started to crash loop back. The receive is running on a cluster, the compactor is running on a different cluster.

All the Thanos store components are using the same storage config (Azure Blob Storage)

What you expected to happen: The receive in cluster #2 keeps running the same way the other receives in cluster #1 and #3

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Logs

16062400000 ulid=01HY6R27XEJASRDRJZPCQFH4MM ts=2024-05-26T03:08:38.106234246Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716062400012 maxt=1716069600000 ulid=01HY6YXZHGXYNHB0V23SR7HTFR ts=2024-05-26T03:08:38.106255535Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716069600029 maxt=1716076800000 ulid=01HY75SPSN64JGPNEVMPH9JY5H ts=2024-05-26T03:08:38.10627417Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716076800032 maxt=1716084000000 ulid=01HY7CNDQXW6RZ9RA39029ZX1G ts=2024-05-26T03:08:38.106915603Z caller=receive.go:601 level=info component=receive msg="shutting down storage" ts=2024-05-26T03:08:38.106926284Z caller=receive.go:605 level=info component=receive msg="storage is flushed successfully" ts=2024-05-26T03:08:38.1069309Z caller=receive.go:611 level=info component=receive msg="storage is closed" ts=2024-05-26T03:08:38.106943423Z caller=http.go:91 level=info component=receive service=http/server component=receive msg="internal server is shutting down" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.106963196Z caller=receive.go:693 level=info component=receive component=uploader msg="uploading the final cut block before exiting" ts=2024-05-26T03:08:38.106983989Z caller=receive.go:702 level=info component=receive component=uploader msg="the final cut block was uploaded" uploaded=0 ts=2024-05-26T03:08:38.107007441Z caller=http.go:110 level=info component=receive service=http/server component=receive msg="internal server is shutdown gracefully" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107022125Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107064152Z caller=grpc.go:138 level=info component=receive service=gRPC/server component=receive msg="internal server is shutting down" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.10708308Z caller=grpc.go:151 level=info component=receive service=gRPC/server component=receive msg="gracefully stopping internal server" ts=2024-05-26T03:08:38.107113074Z caller=grpc.go:164 level=info component=receive service=gRPC/server component=receive msg="internal server is shutdown gracefully" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107129198Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107211886Z caller=main.go:171 level=error err="open /var/thanos/receive/default-tenant/wal/00001125: no space left on device\nopening storage\nmain.startTSDBAndUpload.func1\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/src/github.com/thanos-io/thanos/cmd/thanos/receive.go:643\ngithub.com/oklog/run.(*Group).Run.func1\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/opt/bitnami/go/src/runtime/asm_amd64.s:1650\nreceive command failed\nmain.main\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/src/github.com/thanos-io/thanos/cmd/thanos/main.go:171\nruntime.main\n\t/opt/bitnami/go/src/runtime/proc.go:267\nruntime.goexit\n\t/opt/bitnami/go/src/runtime/asm_amd64.s:1650"

Anything else we need to know:

atayfour commented 3 weeks ago

We are facing the same issue with Thanos receivers. it's not clear yet what is the issue.

ts=2024-08-21T17:00:04.446859956Z caller=db.go:1014 level=error component=receive component=multi-tsdb tenant=XXXX msg="compaction failed" err="preallocate: no space left on device"

- receive
    - --log.level=warn
    - --log.format=logfmt
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --receive.replication-factor=2
    - --tsdb.retention=1d
    - --label=receive="true"
    - --objstore.config-file=/config/thanos-store.yml
    - --tsdb.path=/var/thanos/receive
    - --receive.default-tenant-id=default
    - --label=receive_replica="$(NAME)"
    - --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
    - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json

We had 3 replicas and we are mounting a 10GB volume. I checked the PVC, and it used only 50% of it.

Will decrease the retention to 12h will fix the issue?