thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.11k stars 2.1k forks source link

store/compact: bucket fetch is broken in v0.32 vs v0.31 #6707

Closed sepich closed 5 months ago

sepich commented 1 year ago

Thanos, Prometheus and Golang version used: v0.32.2

Object Storage Provider: GCS

What happened: Tried to upgrade from v0.31.0 to v0.32.2 and seeing for both compactor and store:

Using images available on docker hub the issue is narrowed down to: thanosio/thanos:main-2023-07-03-9710935 - ok thanosio/thanos:main-2023-07-03-b60452a - bad So it is https://github.com/thanos-io/thanos/pull/6474/files to blame cc @fpetkovski

Also attaching pprof cpu profile for 60s, when store becomes ready and there is no any queries to it: Archive.zip

What you expected to happen: New version is better than previous ;)

How to reproduce it (as minimally and precisely as possible):

          image: thanosio/thanos:main-2023-07-03-b60452a # main-2023-07-03-9710935 - works
          args:
            - store
            - |
              --objstore.config=type: GCS
              config:
                bucket: <bucketname>

Full logs to relevant components: No any errors in logs, just very slow timings

gburek-fastly commented 1 year ago

We see the same with our large bucket: 0.31.0: msg="successfully synchronized block metadata" duration=21.876800148s duration_ms=21876 cached=38595 returned=28518 partial=4

0.32.2: msg="successfully synchronized block metadata" duration=28m53.396172016s duration_ms=1733396 cached=38014 returned=28518 partial=4

We suspect the changes in #6474, as using a single threaded Iter api call to list all 8.5e6 objects in the bucket is far slower than parallelizable Exists calls to check 38014 metadata.json objects.

mhoffm-aiven commented 1 year ago

I wonder if we can Iter one level deep instead of over all chunks.

fpetkovski commented 1 year ago

@gburek-fastly Which object store provider do use?

The reason for doing a recursive iter was to reduce costs, and I am also curious if you see a reduction in daily spend for API operations.

gburek-fastly commented 1 year ago

We use GCS as an object store.

We have not seen a reduction in daily spend, though, as we have not deployed 0.32.x to production. This regression was discovered in a canary and it wouldn't be great to have long term searches be unavailable for 20m or more as store-gateways recover from OOMs or other failures.

TBF we previously controlled costs a bit by raising --sync-block-duration to 30m. But in this case, the API calls are a cost of doing business while reducing time to restore from failures.

sepich commented 1 year ago

The reason for doing a recursive iter was to reduce costs

There is no cost reduction in v0.32 for GCS because from bucket perspective request rate reduces 10%, but bandwidth increases 10x. Also, compactor cannot process its queue anymore, as it becomes slower x3 or more. It seems to be not a problem with Iter itself, but with GCS implementation of it. From cpu profile I've attached a lot of time is wasted in Error handling and json decode. Seems like a bug.

fpetkovski commented 1 year ago

I will try out a version with https://github.com/thanos-io/objstore/pull/76 to see if it helps.

sepich commented 1 year ago

I've tried image thanosio/thanos:main-2023-09-14-8bd6264, it is a little bit faster:

ts=2023-09-16T06:53:51.440284946Z caller=store.go:423 level=info msg="initializing bucket store"
ts=2023-09-16T07:04:11.939958956Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10m20.499150242s duration_ms=620499 cached=9748 returned=5968 partial=1780
ts=2023-09-16T07:06:02.051450162Z caller=store.go:440 level=info msg="bucket store ready" init_duration=12m10.610648657s
ts=2023-09-16T07:06:02.051708319Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready

but unfortunately it continues to consume cpu/network. Here is 3 lines for thanos-store v0.31.0 and orange line for main-2023-09-14-8bd6264:

image

I'm attaching new cpu profile for 60s, which is taken at some point after status=ready: profile.zip

Now it is Bucket Iter:

image
fpetkovski commented 1 year ago

I am not sure why the iteration takes so much time, here is what we have in our environment:

ts=2023-09-20T10:22:01.063344663Z caller=fetcher.go:491 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7m24.882856514s duration_ms=444882 cached=155134 returned=516 partial=3

The store has gone through 155K blocks in object storage in 7m, and from what I see you only have ~5k. I noticed that you have many partial blocks, are those from failed compactions?

The reason why memory and network utilization are constantly high is because the default sync interval is 3m, and it takes longer than that to do the actual sync, so the store is constantly running that loop. You can also bump this value to something like 15m to avoid this behavior.

Taking deeper look at the profile, it actually shows that updating cuckoo filters is the most intensive operation, and that has been removed in the latest patch release. The list is definitely more expensive now though, but maybe you can give the latest release another shot, and we'll see if we need to do more investigation or potentially roll back.

image
gburek-fastly commented 1 year ago

With 0.32.3 we are seeing pretty close perf to 0.32.0

ts=2023-09-20T21:34:41.564220246Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=24m48.323615601s duration_ms=1488323 cached=38105 returned=28929 partial=110

24m48 for 38105 blocks

sepich commented 1 year ago

What seems not normal in v0.32 is the constant incoming network traffic for thanos-store. It is 2Mbit/s after it is ready for v0.32.3. I can't see what's changed from bucket metrics. No way to check requests via tcpdump, as it is TLS. Tried strace, but haven't found any establishing of new connections:

# strace -f -e trace=network -p 3688968
strace: Process 3688968 attached with 44 threads
[pid 3689008] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
[pid 3689011] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
[pid 3689010] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
[pid 3689011] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
[pid 3689008] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=65534} ---
...

Also tried sysdig, but cant dump request body there:

sysdig -A proc.pid=3688968
389105 21:19:31.208055300 1 thanos (2293198) > read fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=40960
389120 21:19:31.208074140 1 thanos (2293198) > read fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=40960
389262 21:19:31.208243741 1 thanos (2293198) > read fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=40960
389373 21:19:31.208366342 1 thanos (2293198) > read fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=40960
389383 21:19:31.208376262 1 thanos (2293198) > write fd=8(<4t>172.17.0.4:38764->142.250.181.251:443) size=39
...

--log.level=debug also does not show anything interesting. Maybe it is possible to add some CGS requests logging there?

gburek-fastly commented 1 year ago

We attempted to upgrade our compact nodes to 0.32.3 and let it run for about 24h. Compaction basically stopped and a backlog of raw uploaded blocks has built up.

compact cli args: thanos compact --http-address=127.0.0.1:9090 --log.format=logfmt --wait --wait-interval=1h --data-dir=/data --objstore.config-file=/etc/thanos/objstore_thanos.yml --block-files-concurrency=12 --compact.blocks-fetch-concurrency=12 --block-viewer.global.sync-block-interval=30m --selector.relabel-config-file=/etc/thanos/relabel_config.yml --compact.concurrency=8 --downsample.concurrency=8 --deduplication.replica-label=receive_replica --consistency-delay=12h --delete-delay=12h --web.external-prefix=bucket-web --bucket-web-label=site --retention.resolution-raw=400d --retention.resolution-5m=400d --retention.resolution-1h=400d --compact.cleanup-interval=12h --no-debug.halt-on-error

logs ``` ts=2023-09-22T22:48:44.472506334Z caller=compact.go:1414 level=info msg="start sync of metas" ts=2023-09-22T22:53:13.810616213Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=44m37.518213438s duration_ms=2677518 cached=39182 returned=39182 partial=38 ts=2023-09-22T22:53:32.069351219Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=44m55.803947742s duration_ms=2695803 cached=39182 returned=10099 partial=38 ts=2023-09-22T22:58:14.002343971Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T22:58:14.032696297Z caller=compact.go:502 level=error msg="retriable error" err="compaction: sync: BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T22:58:14.032736177Z caller=compact.go:1414 level=info msg="start sync of metas" ts=2023-09-22T23:03:14.023330533Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:03:14.023462313Z caller=compact.go:502 level=error msg="retriable error" err="compaction: sync: BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:03:14.023516122Z caller=compact.go:1414 level=info msg="start sync of metas" ts=2023-09-22T23:08:14.023650429Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:08:14.023758559Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:13:14.024714964Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:13:14.024837164Z caller=compact.go:502 level=error msg="retriable error" err="compaction: sync: BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:18:14.025231553Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:18:14.025321263Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:23:14.025944656Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:23:14.026004887Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:28:14.026829795Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:28:14.026921615Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:33:14.027700345Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:33:14.027805326Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:38:14.028712317Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:38:14.028771218Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:43:14.029704851Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:43:14.029820401Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:48:14.030548316Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:48:14.030675726Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:53:14.031283709Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:53:14.031397409Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:58:14.031621177Z caller=runutil.go:100 level=error msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-22T23:58:14.031724578Z caller=compact.go:597 level=error msg="retriable error" err="BaseFetcher: iter bucket: context deadline exceeded" ts=2023-09-23T00:02:33.566897678Z caller=compact.go:1414 level=info msg="start sync of metas" ```

For us, with our block and data load Compact and store are unusable without a cli arg to enable the Exists strategy when listing all objects in the bucket takes 20 to 45 minutes for GCS.

I am happy to run debug builds, add logging, generate trace spans or whatever, to show this.

gburek-fastly commented 1 year ago

@fpetkovski you mentioned above:

The store has gone through 155K blocks in object storage in 7m

How many objects are in this bucket that make up 155k blocks? What is the backing object store?

sepich commented 12 months ago

Here is easier way to reproduce the issue:

$ time docker run -it --rm -v sa.json:/sa.json -e GOOGLE_APPLICATION_CREDENTIALS=/sa.json thanosio/thanos:v0.31.0 tools bucket ls --objstore.config='{type: GCS, config: {bucket: <bucket>}}'

level=info ts=2023-11-13T08:16:17.77034421Z caller=factory.go:52 msg="loading bucket configuration"
level=info ts=2023-11-13T08:17:34.399242748Z caller=fetcher.go:478 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1m16.718396163s duration_ms=76718 cached=9978 returned=9978 partial=2382
...
level=info ts=2023-11-13T08:17:34.455320602Z caller=tools_bucket.go:463 msg="ls done" objects=9978
level=info ts=2023-11-13T08:17:34.455737735Z caller=main.go:164 msg=exiting

real    1m14.275s
user    0m0.093s
sys 0m0.137s

And new version never finishes (same 5m timeout each time):

$ time docker run -it --rm -v sa.json:/sa.json -e GOOGLE_APPLICATION_CREDENTIALS=/sa.json thanosio/thanos:v0.32.5 tools bucket ls --objstore.config='{type: GCS, config: {bucket: <bucket>}}'
ts=2023-11-13T08:20:21.77540959Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2023-11-13T08:25:21.565157156Z caller=main.go:135 level=error err="context deadline exceeded\nBaseFetcher: iter bucket\ngithub.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetchMetadata\n\t/app/pkg/block/fetcher.go:383\ngithub.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch.func2\n\t/app/pkg/block/fetcher.go:456\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/go/pkg/mod/github.com/golang/groupcache@v0.0.0-20210331224755-41bb18bfe9da/singleflight/singleflight.go:56\ngithub.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch\n\t/app/pkg/block/fetcher.go:454\ngithub.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch\n\t/app/pkg/block/fetcher.go:514\nmain.registerBucketLs.func1\n\t/app/cmd/thanos/tools_bucket.go:465\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\npreparing tools bucket ls command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"

real    5m0.388s
user    0m0.143s
sys 0m0.094s
GiedriusS commented 12 months ago

I wonder how we could reproduce this locally ourselves :thinking:

fpetkovski commented 12 months ago

I think this is a result of the fact that recursive iter is simply slower, and I don't know if we have a good way to fix it. I am fine with reverting the change, but if others have seen benefits, we could maybe make it optional?

sepich commented 12 months ago

By adding a few fmt.Println I was able to find out the reason. Our bucket had 2515032 objects for 2020-2021 like: /debug/metas/%ULID%.json Looks like remains of: https://github.com/thanos-io/thanos/issues/3839 After deleting these, bucket ls for v0.32.5 works with the same speed as v0.31. And store timings back in order:

ts=2023-11-14T09:46:24.344376629Z caller=store.go:435 level=info msg="bucket store ready" init_duration=53.578281039s
ts=2023-11-14T09:46:24.344762123Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready
ts=2023-11-14T09:46:24.34489548Z caller=grpc.go:131 level=info service=gRPC/server component=store msg="listening for serving gRPC" address=0.0.0.0:10901
ts=2023-11-14T09:46:25.174331359Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=829.850131ms duration_ms=829 cached=10281 returned=6976 partial=2495
fpetkovski commented 12 months ago

Thanks for reporting back @sepich, I am glad this was resolved for you.

fpetkovski commented 12 months ago

How many objects are in this bucket that make up 155k blocks? What is the backing object store?

@gburek-fastly sorry just seeing this. We use GCS, but I am not sure what is the total object count. Does the solution from @sepich resolve your problem?

yeya24 commented 11 months ago

We got the same issue I think and cleaning up objects (especially those versioned objects) in the S3 bucket helps in our case. I think we can revert the change or merge https://github.com/thanos-io/thanos/pull/6902, which provides a way to customize how blocks should be fetched, then we can keep both recursive iteration and the old method using exists.

yeya24 commented 11 months ago

After #6902 is merged, I wonder how we should proceed with improving this. Should we add the previous block fetch method back and make the block fetch method configurable? I wonder if we really need to expose such low level details

fpetkovski commented 11 months ago

We reduced our daily object storage bill by thousands of dollars per day by using a recursive iterator, so having no escape hatch will put us in a very bad spot. I would say the root cause is the large amount of partial blocks in object storage, or data unrelated to Thanos. For the versioned objects use case, we should probably change the library to only request the latest version. I am also fine with having both methods as options since there are obvious trade-offs.

We still haven't heard from @gburek-fastly whether his issue falls into one of these two categories.

gburek-fastly commented 11 months ago

Thank you for the tip wrt the debug meta.jsons. There were a lot of these. We have removed 4e6 debug objects, which has improved our bucket sync times from 24m to 12m.

0.32.3: msg="successfully synchronized block metadata" duration=12m10.960457779s duration_ms=730960 cached=38176 returned=29776 partial=6

This is still, unfortunately, several orders of magnitude worse than 0.31.0, which is at 2s: msg="successfully synchronized block metadata" duration=2.431460715s duration_ms=2431 cached=38158 returned=29776 partial=8

Our 38K blocks, without the debug metas, are comprised of 4.98e6 objects and we see a linear relationship between start up times and total bucket obj count. As far as I can tell, we see around 10-75 no-meta-json blocks in thanos_blocks_meta_synced and don't have other extraneous objects in the bucket.

For context, we use GCS and have versioned objects.

I think making the fetch method configurable is probably the way to go, as we effectively have two extremes of operation:

  1. large blocks with hundreds of chunks per block, that are refreshed seldomly to control costs
  2. many small blocks that are refreshed frequently to help with freshness

The efficiency of using iter may be primarily dictated by density of meta.json objects in the bucket. It will of course be expensive to continually HEAD hundreds of thousands meta.json objects, but in our case, where meta.json objects are barely 1% of the objects in the bucket, the time to list all objects is prohibitive vs the cost to HEAD a few objects.

I think this should be configurable, at least, to allow for our different data environments and give a choice between cost/start up times.

wndhydrnt commented 10 months ago

Hello, our Thanos setup is also affected by this issue. The object store is S3 in our case, not GCS.

Startup times of the store component increased from 2-3 seconds to ~40 minutes after the upgrade to v0.32.5. Deleting /debug/metas/%ULID%.json objects got the startup time down to ~12 minutes, but that is still way higher than in v0.31.x.

Runtime of the compact component also increased from ~03h20m to ~05h15m.

We have not observed any significant increase or decrease in costs of our S3 bucket.

Being able to switch back to the old fetch logic would allow us to upgrade to v0.32.x and newer.

mkrull commented 9 months ago

We are in the same situation as people above. Our storegateway startup time increased from a few minutes to more than 1h for large S3 buckets and thus we are stuck on 0.31.

GiedriusS commented 9 months ago

What storage providers are you using? AWS S3 or some other implementation?

mkrull commented 9 months ago

What storage providers are you using? AWS S3 or some other implementation?

AWS S3 for us

mkrull commented 9 months ago

Is there any chance of getting traction on this to expose the old bucket sync method in some way?

fpetkovski commented 9 months ago

I am already on it https://github.com/thanos-io/thanos/pull/7134

mfoldenyi commented 5 months ago

@sepich Can this be closed? #7134 has been merged.