Closed sepich closed 5 months ago
We see the same with our large bucket:
0.31.0: msg="successfully synchronized block metadata" duration=21.876800148s duration_ms=21876 cached=38595 returned=28518 partial=4
0.32.2: msg="successfully synchronized block metadata" duration=28m53.396172016s duration_ms=1733396 cached=38014 returned=28518 partial=4
We suspect the changes in #6474, as using a single threaded Iter
api call to list all 8.5e6 objects in the bucket is far slower than parallelizable Exists
calls to check 38014 metadata.json objects.
I wonder if we can Iter one level deep instead of over all chunks.
@gburek-fastly Which object store provider do use?
The reason for doing a recursive iter was to reduce costs, and I am also curious if you see a reduction in daily spend for API operations.
We use GCS as an object store.
We have not seen a reduction in daily spend, though, as we have not deployed 0.32.x to production. This regression was discovered in a canary and it wouldn't be great to have long term searches be unavailable for 20m or more as store-gateways recover from OOMs or other failures.
TBF we previously controlled costs a bit by raising --sync-block-duration
to 30m
. But in this case, the API calls are a cost of doing business while reducing time to restore from failures.
The reason for doing a recursive iter was to reduce costs
There is no cost reduction in v0.32 for GCS because from bucket perspective request rate reduces 10%, but bandwidth increases 10x. Also, compactor cannot process its queue anymore, as it becomes slower x3 or more.
It seems to be not a problem with Iter
itself, but with GCS implementation of it. From cpu profile I've attached a lot of time is wasted in Error
handling and json decode. Seems like a bug.
I will try out a version with https://github.com/thanos-io/objstore/pull/76 to see if it helps.
I've tried image thanosio/thanos:main-2023-09-14-8bd6264
, it is a little bit faster:
ts=2023-09-16T06:53:51.440284946Z caller=store.go:423 level=info msg="initializing bucket store"
ts=2023-09-16T07:04:11.939958956Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10m20.499150242s duration_ms=620499 cached=9748 returned=5968 partial=1780
ts=2023-09-16T07:06:02.051450162Z caller=store.go:440 level=info msg="bucket store ready" init_duration=12m10.610648657s
ts=2023-09-16T07:06:02.051708319Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready
but unfortunately it continues to consume cpu/network. Here is 3 lines for thanos-store v0.31.0 and orange line for main-2023-09-14-8bd6264:
I'm attaching new cpu profile for 60s, which is taken at some point after status=ready
:
profile.zip
Now it is Bucket Iter
:
I am not sure why the iteration takes so much time, here is what we have in our environment:
ts=2023-09-20T10:22:01.063344663Z caller=fetcher.go:491 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7m24.882856514s duration_ms=444882 cached=155134 returned=516 partial=3
The store has gone through 155K blocks in object storage in 7m, and from what I see you only have ~5k. I noticed that you have many partial blocks, are those from failed compactions?
The reason why memory and network utilization are constantly high is because the default sync interval is 3m
, and it takes longer than that to do the actual sync, so the store is constantly running that loop. You can also bump this value to something like 15m
to avoid this behavior.
Taking deeper look at the profile, it actually shows that updating cuckoo filters is the most intensive operation, and that has been removed in the latest patch release. The list is definitely more expensive now though, but maybe you can give the latest release another shot, and we'll see if we need to do more investigation or potentially roll back.
With 0.32.3
we are seeing pretty close perf to 0.32.0
ts=2023-09-20T21:34:41.564220246Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=24m48.323615601s duration_ms=1488323 cached=38105 returned=28929 partial=110
24m48 for 38105 blocks
What seems not normal in v0.32 is the constant incoming network traffic for thanos-store. It is 2Mbit/s after it is ready
for v0.32.3. I can't see what's changed from bucket metrics. No way to check requests via tcpdump
, as it is TLS. Tried strace
, but haven't found any establishing of new connections:
# strace -f -e trace=network -p 3688968
strace: Process 3688968 attached with 44 threads
[pid 3689008] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
[pid 3689011] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
[pid 3689010] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
[pid 3689011] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
[pid 3689008] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
...
Also tried sysdig, but cant dump request body there:
sysdig -A proc.pid=3688968
389105 21:19:31.208055300 1 thanos (2293198) > read fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=40960
389120 21:19:31.208074140 1 thanos (2293198) > read fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=40960
389262 21:19:31.208243741 1 thanos (2293198) > read fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=40960
389373 21:19:31.208366342 1 thanos (2293198) > read fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=40960
389383 21:19:31.208376262 1 thanos (2293198) > write fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=39
...
--log.level=debug
also does not show anything interesting. Maybe it is possible to add some CGS requests logging there?
We attempted to upgrade our compact nodes to 0.32.3
and let it run for about 24h. Compaction basically stopped and a backlog of raw uploaded blocks has built up.
compact cli args: thanos compact --http-address=127.0.0.1:9090 --log.format=logfmt --wait --wait-interval=1h --data-dir=/data --objstore.config-file=/etc/thanos/objstore_thanos.yml --block-files-concurrency=12 --compact.blocks-fetch-concurrency=12 --block-viewer.global.sync-block-interval=30m --selector.relabel-config-file=/etc/thanos/relabel_config.yml --compact.concurrency=8 --downsample.concurrency=8 --deduplication.replica-label=receive_replica --consistency-delay=12h --delete-delay=12h --web.external-prefix=bucket-web --bucket-web-label=site --retention.resolution-raw=400d --retention.resolution-5m=400d --retention.resolution-1h=400d --compact.cleanup-interval=12h --no-debug.halt-on-error
For us, with our block and data load Compact and store are unusable without a cli arg to enable the Exists strategy when listing all objects in the bucket takes 20 to 45 minutes for GCS.
I am happy to run debug builds, add logging, generate trace spans or whatever, to show this.
@fpetkovski you mentioned above:
The store has gone through 155K blocks in object storage in 7m
How many objects are in this bucket that make up 155k blocks? What is the backing object store?
Here is easier way to reproduce the issue:
$ time docker run -it --rm -v sa.json:/sa.json -e GOOGLE_APPLICATION_CREDENTIALS=/sa.json thanosio/thanos:v0.31.0 tools bucket ls --objstore.config='{type: GCS, config: {bucket: <bucket>}}'
level=info ts=2023-11-13T08:16:17.77034421Z caller=factory.go:52 msg="loading bucket configuration"
level=info ts=2023-11-13T08:17:34.399242748Z caller=fetcher.go:478 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1m16.718396163s duration_ms=76718 cached=9978 returned=9978 partial=2382
...
level=info ts=2023-11-13T08:17:34.455320602Z caller=tools_bucket.go:463 msg="ls done" objects=9978
level=info ts=2023-11-13T08:17:34.455737735Z caller=main.go:164 msg=exiting
real 1m14.275s
user 0m0.093s
sys 0m0.137s
And new version never finishes (same 5m timeout each time):
$ time docker run -it --rm -v sa.json:/sa.json -e GOOGLE_APPLICATION_CREDENTIALS=/sa.json thanosio/thanos:v0.32.5 tools bucket ls --objstore.config='{type: GCS, config: {bucket: <bucket>}}'
ts=2023-11-13T08:20:21.77540959Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2023-11-13T08:25:21.565157156Z caller=main.go:135 level=error err="context deadline exceeded\nBaseFetcher: iter bucket\ngithub.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetchMetadata\n\t/app/pkg/block/fetcher.go:383\ngithub.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch.func2\n\t/app/pkg/block/fetcher.go:456\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/go/pkg/mod/github.com/golang/groupcache@v0.0.0-20210331224755-41bb18bfe9da/singleflight/singleflight.go:56\ngithub.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch\n\t/app/pkg/block/fetcher.go:454\ngithub.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch\n\t/app/pkg/block/fetcher.go:514\nmain.registerBucketLs.func1\n\t/app/cmd/thanos/tools_bucket.go:465\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\npreparing tools bucket ls command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
real 5m0.388s
user 0m0.143s
sys 0m0.094s
I wonder how we could reproduce this locally ourselves :thinking:
I think this is a result of the fact that recursive iter is simply slower, and I don't know if we have a good way to fix it. I am fine with reverting the change, but if others have seen benefits, we could maybe make it optional?
By adding a few fmt.Println I was able to find out the reason.
Our bucket had 2515032 objects for 2020-2021 like:
/debug/metas/%ULID%.json
Looks like remains of: https://github.com/thanos-io/thanos/issues/3839
After deleting these, bucket ls
for v0.32.5 works with the same speed as v0.31.
And store
timings back in order:
ts=2023-11-14T09:46:24.344376629Z caller=store.go:435 level=info msg="bucket store ready" init_duration=53.578281039s
ts=2023-11-14T09:46:24.344762123Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready
ts=2023-11-14T09:46:24.34489548Z caller=grpc.go:131 level=info service=gRPC/server component=store msg="listening for serving gRPC" address=0.0.0.0:10901
ts=2023-11-14T09:46:25.174331359Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=829.850131ms duration_ms=829 cached=10281 returned=6976 partial=2495
Thanks for reporting back @sepich, I am glad this was resolved for you.
How many objects are in this bucket that make up 155k blocks? What is the backing object store?
@gburek-fastly sorry just seeing this. We use GCS, but I am not sure what is the total object count. Does the solution from @sepich resolve your problem?
We got the same issue I think and cleaning up objects (especially those versioned objects) in the S3 bucket helps in our case. I think we can revert the change or merge https://github.com/thanos-io/thanos/pull/6902, which provides a way to customize how blocks should be fetched, then we can keep both recursive iteration and the old method using exists.
After #6902 is merged, I wonder how we should proceed with improving this. Should we add the previous block fetch method back and make the block fetch method configurable? I wonder if we really need to expose such low level details
We reduced our daily object storage bill by thousands of dollars per day by using a recursive iterator, so having no escape hatch will put us in a very bad spot. I would say the root cause is the large amount of partial blocks in object storage, or data unrelated to Thanos. For the versioned objects use case, we should probably change the library to only request the latest version. I am also fine with having both methods as options since there are obvious trade-offs.
We still haven't heard from @gburek-fastly whether his issue falls into one of these two categories.
Thank you for the tip wrt the debug meta.jsons. There were a lot of these. We have removed 4e6 debug objects, which has improved our bucket sync times from 24m to 12m.
0.32.3: msg="successfully synchronized block metadata" duration=12m10.960457779s duration_ms=730960 cached=38176 returned=29776 partial=6
This is still, unfortunately, several orders of magnitude worse than 0.31.0, which is at 2s: msg="successfully synchronized block metadata" duration=2.431460715s duration_ms=2431 cached=38158 returned=29776 partial=8
Our 38K blocks, without the debug metas, are comprised of 4.98e6 objects and we see a linear relationship between start up times and total bucket obj count. As far as I can tell, we see around 10-75 no-meta-json
blocks in thanos_blocks_meta_synced
and don't have other extraneous objects in the bucket.
For context, we use GCS and have versioned objects.
I think making the fetch method configurable is probably the way to go, as we effectively have two extremes of operation:
The efficiency of using iter may be primarily dictated by density of meta.json objects in the bucket. It will of course be expensive to continually HEAD hundreds of thousands meta.json objects, but in our case, where meta.json objects are barely 1% of the objects in the bucket, the time to list all objects is prohibitive vs the cost to HEAD a few objects.
I think this should be configurable, at least, to allow for our different data environments and give a choice between cost/start up times.
Hello, our Thanos setup is also affected by this issue. The object store is S3 in our case, not GCS.
Startup times of the store component increased from 2-3 seconds to ~40 minutes after the upgrade to v0.32.5. Deleting /debug/metas/%ULID%.json
objects got the startup time down to ~12 minutes, but that is still way higher than in v0.31.x.
Runtime of the compact component also increased from ~03h20m to ~05h15m.
We have not observed any significant increase or decrease in costs of our S3 bucket.
Being able to switch back to the old fetch logic would allow us to upgrade to v0.32.x and newer.
We are in the same situation as people above. Our storegateway startup time increased from a few minutes to more than 1h for large S3 buckets and thus we are stuck on 0.31.
What storage providers are you using? AWS S3 or some other implementation?
What storage providers are you using? AWS S3 or some other implementation?
AWS S3 for us
Is there any chance of getting traction on this to expose the old bucket sync method in some way?
I am already on it https://github.com/thanos-io/thanos/pull/7134
@sepich Can this be closed? #7134 has been merged.
Thanos, Prometheus and Golang version used: v0.32.2
Object Storage Provider: GCS
What happened: Tried to upgrade from v0.31.0 to v0.32.2 and seeing for both
compactor
andstore
:Ready
time to 20min (from 1min for v0.31.0):store
container towards bucket, even after sync of blocks is completed, it isReady
, and there are no query requests comingUsing images available on docker hub the issue is narrowed down to:
thanosio/thanos:main-2023-07-03-9710935
- okthanosio/thanos:main-2023-07-03-b60452a
- bad So it is https://github.com/thanos-io/thanos/pull/6474/files to blame cc @fpetkovskiAlso attaching pprof cpu profile for 60s, when
store
becomesready
and there is no any queries to it: Archive.zipWhat you expected to happen: New version is better than previous ;)
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components: No any errors in logs, just very slow timings