thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.99k stars 2.08k forks source link

sidecar: Do not crash when Object Storage is not accessible #7585

Open ahurtaud opened 1 month ago

ahurtaud commented 1 month ago

Is your proposal related to a problem?

Also related to objstore project.

We had a network outage accessing our storage endpoint. (DNS failure) when sidecar restarted it then go into crashloop with :

ts=2024-08-02T08:22:02.642324362Z caller=main.go:145 level=error err="
Get \"https://<redacted>.privatelink.blob.core.windows.net/<container>?restype=container\": dial tcp: lookup <redacted>.privatelink.blob.core.windows.net on xx.xx.xx.xx:53: no such host\ncreate AZURE client\ngithub.com/thanos-io/objstore/client.NewBucket
    /go/pkg/mod/github.com/thanos-io/objstore@v0.0.0-20240309075357-e8336a5fd5f3/client/factory.go:90\nmain.runSidecar
    /app/cmd/thanos/sidecar.go:327\nmain.registerSidecar.func1
    /app/cmd/thanos/sidecar.go:104\nmain.main
    /app/cmd/thanos/main.go:143\nruntime.main
    /usr/local/go/src/runtime/proc.go:267\nruntime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1650\npreparing sidecar command failed\nmain.main
    /app/cmd/thanos/main.go:145\nruntime.main
    /usr/local/go/src/runtime/proc.go:267\nruntime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1650"

While we consider objectstorage for long term metrics only, we would like sidecar to continue to serve prometheus read path and not crash.

Describe the solution you'd like

Could this error become a warning. And we would alert on a failing metrics or so instead of crashing.

Additional context

Thanos v0.35.0 ObjStore Azure

yeya24 commented 1 month ago

I think it is a valid issue. Help wanted.

amaury-d commented 2 days ago

After a discussion with @MichaHoffmann, we came to realise that sidecar crashing can be useful for some users that rely on it to "detect" when something is wrong (like an uninitialised S3 bucket).

While it was suggested to add a metric to alert on the situation, such situations go could unnoticed.

I suggest to let sidecar crash by default and add an option to allow sidecar to continue to serve prometheus read path even if the objstore is not working.