thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.15k stars 2.1k forks source link

Receive: failing WAL repair and never becomes healthy (resulting in cashloopbackoff) #7035

Open defreng opened 11 months ago

defreng commented 11 months ago

Thanos, Prometheus and Golang version used:

v0.32.4

Object Storage Provider:

no object store configured. Ingestor running in receive-only mode.

What happened:

The pod is in a crash loop with the following error message:

ts=2024-01-05T12:42:56.87385468Z caller=main.go:161 level=error err="repair corrupted WAL: rename /var/thanos/receive/default-tenant/wal/00000662 /var/thanos/receive/default-tenant/wal/00000662.repair: no such file
 or directory\nopening storage\nmain.startTSDBAndUpload.func1\n\t/app/cmd/thanos/receive.go:639\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/us
r/local/go/src/runtime/asm_amd64.s:1598\nreceive command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd6
4.s:1598"

Full logs to relevant components:

Logs

``` ts=2024-01-05T12:42:35.68938568Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=641 maxSegment=799 ts=2024-01-05T12:42:35.689501023Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=642 maxSegment=799 ts=2024-01-05T12:42:35.689646617Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=643 maxSegment=799 ts=2024-01-05T12:42:35.689769639Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=644 maxSegment=799 ts=2024-01-05T12:42:35.809082104Z caller=head_wal.go:829 level=warn component=receive component=multi-tsdb tenant=default-tenant msg="Unknown series references for ooo WAL replay" samples=12 mmap_markers=3 ts=2024-01-05T12:42:35.809122025Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=645 maxSegment=799 ts=2024-01-05T12:42:35.810426955Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=646 maxSegment=799 ts=2024-01-05T12:42:35.81059298Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=647 maxSegment=799 ts=2024-01-05T12:42:35.810711823Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=648 maxSegment=799 ts=2024-01-05T12:42:35.81632747Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=649 maxSegment=799 ts=2024-01-05T12:42:35.823118456Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=650 maxSegment=799 ts=2024-01-05T12:42:35.973908239Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=651 maxSegment=799 ts=2024-01-05T12:42:35.974684818Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=652 maxSegment=799 ts=2024-01-05T12:42:35.974944004Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=653 maxSegment=799 ts=2024-01-05T12:42:35.975154149Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=654 maxSegment=799 ts=2024-01-05T12:42:35.975317153Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=655 maxSegment=799 ts=2024-01-05T12:42:35.975485987Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=656 maxSegment=799 ts=2024-01-05T12:42:36.084448518Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=657 maxSegment=799 ts=2024-01-05T12:42:36.859455564Z caller=head_wal.go:829 level=warn component=receive component=multi-tsdb tenant=default-tenant msg="Unknown series references for ooo WAL replay" samples=28 mmap_markers=6 ts=2024-01-05T12:42:36.859503566Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=658 maxSegment=799 ts=2024-01-05T12:42:37.003307768Z caller=head_wal.go:829 level=warn component=receive component=multi-tsdb tenant=default-tenant msg="Unknown series references for ooo WAL replay" samples=4 mmap_markers=1 ts=2024-01-05T12:42:37.003350819Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=659 maxSegment=799 ts=2024-01-05T12:42:37.003533893Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=660 maxSegment=799 ts=2024-01-05T12:42:49.169614602Z caller=head_wal.go:829 level=warn component=receive component=multi-tsdb tenant=default-tenant msg="Unknown series references for ooo WAL replay" samples=138 mmap_markers=69 ts=2024-01-05T12:42:49.169673225Z caller=head.go:789 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WBL segment loaded" segment=661 maxSegment=799 ts=2024-01-05T12:42:55.948206844Z caller=db.go:898 level=warn component=receive component=multi-tsdb tenant=default-tenant msg="Encountered WAL read error, attempting repair" err="read records: corruption in segmen t /var/thanos/receive/default-tenant/wbl/00000662 at 1277952: last record is torn" ts=2024-01-05T12:42:55.948271905Z caller=wlog.go:426 level=warn component=receive component=multi-tsdb tenant=default-tenant msg="Starting corruption repair" segment=662 offset=1277952 ts=2024-01-05T12:42:55.94847826Z caller=wlog.go:434 level=warn component=receive component=multi-tsdb tenant=default-tenant msg="Deleting all segments newer than corrupted segment" segment=662 ts=2024-01-05T12:42:55.948512231Z caller=wlog.go:456 level=warn component=receive component=multi-tsdb tenant=default-tenant msg="Rewrite corrupted segment" segment=662 ts=2024-01-05T12:42:56.092308843Z caller=head_wal.go:1250 level=info component=receive component=multi-tsdb tenant=default-tenant msg="creating chunk snapshot" ts=2024-01-05T12:42:56.87307524Z caller=head_wal.go:1255 level=info component=receive component=multi-tsdb tenant=default-tenant msg="chunk snapshot complete" duration=780.709436ms num_series=308093 dir=/var/thanos /receive/default-tenant/chunk_snapshot.000135.0000000000 ts=2024-01-05T12:42:56.873142442Z caller=wlog.go:872 level=error component=receive component=multi-tsdb tenant=default-tenant msg="sync previous segment" err="sync /var/thanos/receive/default-tenant/wal/00000135: f ile already closed" ts=2024-01-05T12:42:56.873153142Z caller=wlog.go:875 level=error component=receive component=multi-tsdb tenant=default-tenant msg="close previous segment" err="close /var/thanos/receive/default-tenant/wal/00000135: file already closed" ts=2024-01-05T12:42:56.873183963Z caller=head_wal.go:1250 level=info component=receive component=multi-tsdb tenant=default-tenant msg="creating chunk snapshot" ts=2024-01-05T12:42:56.873344116Z caller=head_wal.go:1255 level=info component=receive component=multi-tsdb tenant=default-tenant msg="chunk snapshot complete" duration=145.753µs num_series=0 dir= ts=2024-01-05T12:42:56.873381937Z caller=receive.go:597 level=info component=receive msg="shutting down storage" ts=2024-01-05T12:42:56.873394568Z caller=receive.go:601 level=info component=receive msg="storage is flushed successfully" ts=2024-01-05T12:42:56.873401948Z caller=receive.go:607 level=info component=receive msg="storage is closed" ts=2024-01-05T12:42:56.873430289Z caller=http.go:91 level=info component=receive service=http/server component=receive msg="internal server is shutting down" err="opening storage: repair corrupted WAL: rename /var/ thanos/receive/default-tenant/wal/00000662 /var/thanos/receive/default-tenant/wal/00000662.repair: no such file or directory" ts=2024-01-05T12:42:56.873589452Z caller=http.go:110 level=info component=receive service=http/server component=receive msg="internal server is shutdown gracefully" err="opening storage: repair corrupted WAL: renam e /var/thanos/receive/default-tenant/wal/00000662 /var/thanos/receive/default-tenant/wal/00000662.repair: no such file or directory" ts=2024-01-05T12:42:56.873610763Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: repair corrupted WAL: rename /var/thanos/receive/de fault-tenant/wal/00000662 /var/thanos/receive/default-tenant/wal/00000662.repair: no such file or directory" ts=2024-01-05T12:42:56.873669214Z caller=grpc.go:138 level=info component=receive service=gRPC/server component=receive msg="internal server is shutting down" err="opening storage: repair corrupted WAL: rename /var /thanos/receive/default-tenant/wal/00000662 /var/thanos/receive/default-tenant/wal/00000662.repair: no such file or directory" ts=2024-01-05T12:42:56.873690625Z caller=grpc.go:151 level=info component=receive service=gRPC/server component=receive msg="gracefully stopping internal server" ts=2024-01-05T12:42:56.873747437Z caller=grpc.go:164 level=info component=receive service=gRPC/server component=receive msg="internal server is shutdown gracefully" err="opening storage: repair corrupted WAL: renam e /var/thanos/receive/default-tenant/wal/00000662 /var/thanos/receive/default-tenant/wal/00000662.repair: no such file or directory" ts=2024-01-05T12:42:56.873765586Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: repair corrupted WAL: rename /var/thanos/receive/de fault-tenant/wal/00000662 /var/thanos/receive/default-tenant/wal/00000662.repair: no such file or directory" ts=2024-01-05T12:42:56.87385468Z caller=main.go:161 level=error err="repair corrupted WAL: rename /var/thanos/receive/default-tenant/wal/00000662 /var/thanos/receive/default-tenant/wal/00000662.repair: no such file or directory\nopening storage\nmain.startTSDBAndUpload.func1\n\t/app/cmd/thanos/receive.go:639\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/us r/local/go/src/runtime/asm_amd64.s:1598\nreceive command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd6 4.s:1598" ```

GiedriusS commented 11 months ago

What files are in the wal directory ? It sounds like someone deleted some files from the wal directory.

defreng commented 11 months ago

Unfortunately we don't have the files anymore...

But no files were touched by anyone except for the thanos binary. When I'm looking at the full log, it seems like there were some other errors with this file before. Maybe thanos deleted it itself?