thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.98k stars 2.08k forks source link

Thanos Receiver: Error logs "try lock file: open <>.tmp-for-creation/index_tmp_po: no such file or directory" #7746

Open SM-9870 opened 5 days ago

SM-9870 commented 5 days ago

Thanos, Prometheus and Golang version used: Thanos 0.35.1

Object Storage Provider: IBM S3

What happened: I am facing errors in the Thanos receiver when I am using S3 Storage for mounting PV/PVC for Thanos Receiver TSDB directory. But I do not see this error If I use the local node storage. In my setup I am using same S3 Storage across all the Pods like Compactor, Receiver, Storagegateway etc. but each service has their own Folder inside the S3 mounted storage.

In addition to this error, Thanos is not able to upload the metrics to the S3 storage for long term support.

ts=2024-09-12T01:43:13.164952074Z caller=db.go:1014 level=error component=receive component=multi-tsdb tenant=default-tenant msg="compaction failed" err="try lock file: open /data/receiver-0/default-tenant/01J7HYBVS79NPA7JJ6292NXCEJ.tmp-for-creation/index_tmp_po: no such file or directory"
ts=2024-09-12T01:44:23.166065172Z caller=db.go:1014 level=error component=receive component=multi-tsdb tenant=default-tenant msg="compaction failed" err="try lock file: open /data/receiver-0/default-tenant/01J7HYE0KDYBKFX92ZACZ2FZ01.tmp-for-creation/index_tmp_po: no such file or directory"

Similar error also logged in the Storage gateway:

  ts=2024-09-12T01:37:49.186686337Z caller=binary_reader.go:569 level=debug msg="failed to read index-header from disk; recreating" path=/data/storegw/01J7HXTB9PM6MJPYHCJ4AC3DY1/index-header err="try lock file: open /data/storegw/01J7HXTB9PM6MJPYHCJ4AC3DY1/index-header: no such file or directory"

What you expected to happen: No errors in receiver and metrics are uploaded to S3

How to reproduce it (as minimally and precisely as possible): Attach/Mount S3 storage to IKS POD for Thanos Receiver's TSDB

Full logs to relevant components:

Anything else we need to know:

dosubot[bot] commented 5 days ago

The error "try lock file: open <>.tmp-for-creation/index_tmp_po: no such file or directory" in Thanos Receiver when using IBM S3 storage for mounting PV/PVC is likely due to the eventual consistency model of IBM S3 storage. Thanos operates on object storage APIs, which can be eventually consistent, unlike local filesystems that are strongly consistent. This can lead to intermittent failures in operations like locking files or accessing newly created files.

To mitigate this, you might need to adjust the syncDelay configuration or implement additional retries and checks to handle the eventual consistency of IBM S3 storage [1].

Additionally, there are similar discussions that might provide further insights:

  1. Thanos Receiver: Error logs "try lock file: open <>.tmp-for-creation/index_tmp_po: no such file or directory" [2].
  2. thanos.shipper.json.tmp: read-only file system [3].

To continue talking to Dosu, mention @dosu.

SM-9870 commented 2 days ago

It seems like --sync.delay=10m is not valid for Thanos Receiver. Its valid for Compactor.

GiedriusS commented 21 hours ago

Does your filesystem support hardlinks? Or perhaps the filesystem reports success but no hardlinks are actually created?

SM-9870 commented 12 hours ago

Its S3 storage mounted as filesystem. It seems like hard link are not supported:

bash-4.4$ ln test1 test2
ln: failed to create hard link 'test2' => 'test1': Operation not supported

I am using S3 storage as I am running Thanos in a multi zone Cluster and S3 storage is supported regionally. NFS/Block storage is only supported zone wise.

Do you have any suggestion in this case?