thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

Compact: Offline deduplication #1014

Closed smalldirector closed 3 years ago

smalldirector commented 5 years ago

Currently Thanos provides dedup function in Query component, it can merge the data on the fly from Prometheus HA pairs or remote object storage. However, the Query dedup function is not query efficiently, because the replica's blocks in tsdb/object storage are not really merged so that the Query API has to load duplicated data in each request.

With current Prometheus TSDB design, it seems difficult to implement dedup function to merge blocks in each TSDB node on Thanos side. However, it should be easily to implement it for the object storage as it is one centralized storage, and also we already have one Compactor component running on it.

Offering dedup function on object storage side will definitely help fast metrics query latency and reduce the object storage cost.

Do we have any plan to support such requirements in Thanos? If had, I would like to know your ideas about this feature.

Thanks.

bwplotka commented 5 years ago

Hi :wave: thanks for raising this. I rename title to make it more clear, let me know if this makes sense.

There was always idea like this to allow compactor to deduplicate blocks offline and reduce the storage and bit the query performance. I think I am happy to do so however, we need ourselves question, is the curret deduplication algorithm a correct one for everybody?

We have seen some reports claiming that our "penalty-based" deduplication algorithm is not working for all edge cases e.g: https://github.com/improbable-eng/thanos/issues/981 It's fair as our alghorithm is very basic. The problem with this feature and not perfect algorithm is that we will deduplicate data irreversibily (unless we back up those blocks somehow).

Anyway, I think we are happy to add such feature to compactor at some point, given we can solve the gaps, for current algorithm (and add more tests/ test cases maybe).

smalldirector commented 5 years ago

Thanks for the reply @bwplotka. I'm glad to hear that you also want to support this feature in the future. I'm working in eBay's monitoring infrastructure team, and we are leveraging Thanos to build eBay's high availability monitoring platform. Thanks for providing such great framework.

For the offline dedup function, we have already started to build it inside the Thanos compactor component. If needed, we are more than happy to contribute it back to Thanos community. Please feel free to let me know your thoughts as well.

bwplotka commented 5 years ago

Of course! PRs would be welcome, especially if you have something proven from your prod (:

bwplotka commented 5 years ago

Are you on our slack BTW? (:

smalldirector commented 5 years ago

Which slack channel are your pointing here? I am only be able to see the announcements channel. I'm not able to join the thanos-dev channel which mentioned here: https://github.com/improbable-eng/thanos/blob/master/CONTRIBUTING.md. BTW, my display name is smalldirector in announcements channel.

bwplotka commented 5 years ago

I can see you in all #thanos #thanos-dev channels (:

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Reamer commented 4 years ago

Remove "stale", that's an interesting feature

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

bwplotka commented 4 years ago

Stale bot, go away for now ;p

We are quite close!

@metalmatze is working on it and I am refactoring TSDB to allow us to do it via TSDB code.

On Thu, 27 Feb 2020 at 11:14, stale[bot] notifications@github.com wrote:

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/1014?email_source=notifications&email_token=ABVA3OYQGYMEUPXR4UEUVETRE6OCZA5CNFSM4HEB6SW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEND65VY#issuecomment-591916759, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3OZ6KTITQNE6K3DWWMDRE6OCZANCNFSM4HEB6SWQ .

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

ThoreKr commented 4 years ago

Can someone add the label for stalebot to ignore this issue?

bwplotka commented 4 years ago

We blocked blocking stale bot for a reason. We want to give you update at least every month :+1: We plan to improve this flow here: https://github.com/thanos-io/thanos/issues/2262

Update: On master, and what will be released next week is naive offline deduplication. This means you can configure stuff to deduplicate exactly the same data. E.g produced by Thanos Receive. This was straightforward to do, so we added that as the first iteration here: https://github.com/thanos-io/thanos/pull/2250

You can read changelog how to enable this and what are the consequences.

It's important to know that this will NOT work for replicated Prometheus-es. This is because data is not the same, timestamps are different. If you apply our solutions now you will malform your data. You will be able to query but data will be unreliable really

To fix proper offline dedup we need to replace deduplication algorithm in Prometheus first. I am adding interface allowing us to do so here: https://github.com/prometheus/prometheus/pull/7059 It should be ready in the next week or two.

Cheers :beers:

sepich commented 4 years ago

Great news! Unfortunately this does not work for our current setup and historical data. We run 3 thanos-receive with --label=replica="$(NAME)" Then in bucket inspect we have:

| 01E4N18DW7663APZ5V0B3D3CY1 | 30-03-2020 05:30:00 | 30-03-2020 06:00:00 | 30m0s    | 39h30m0s   | 545,049    | 50,824,911     | 545,054     | 1          | false       | replica=thanos-receive-0 | 0s         | receive   |
| 01E4N18DW8A4Z0ES2GCXBFEVVH | 30-03-2020 05:30:00 | 30-03-2020 06:00:00 | 30m0s    | 39h30m0s   | 545,049    | 50,824,911     | 545,054     | 1          | false       | replica=thanos-receive-2 | 0s         | receive   |
| 01E4N18DW8AR0CRXN0F4N0F7VY | 30-03-2020 05:30:00 | 30-03-2020 06:00:00 | 30m0s    | 39h30m0s   | 545,049    | 50,824,911     | 545,054     | 1          | false       | replica=thanos-receive-1 | 0s         | receive   |

And thanos-compact (master-2020-03-27-4e5c2f25) cannot process that due to: empty external labels are not allowed for Thanos block

level=info ts=2020-03-30T02:24:48.810475048Z caller=compact.go:441 msg="compact blocks" count=3 mint=1583366400000 maxt=1584576000000 ulid=01E4M5KHECGZQ1BV66E6SQGH78 sources="[01E3V4JCD9Z5BK5B8A5KSYGMV3 01E3WMG1NXY78KB6GKXR594FMS 01E3T3609JKQ5V1TKZ8NYTGE58]" duration=4h13m14.923098503s
level=info ts=2020-03-30T02:24:48.955331391Z caller=compact.go:704 compactionGroup=0@{} compactionGroupKey=0@17241709254077376921 msg="compacted blocks" blocks="[/data/compact/0@17241709254077376921/01E3V4JCD9Z5BK5B8A5KSYGMV3 /data/compact/0@17241709254077376921/01E3WMG1NXY78KB6GKXR594FMS /data/compact/0@17241709254077376921/01E3T3609JKQ5V1TKZ8NYTGE58]" duration=4h13m15.067940785s overlapping_blocks=true
level=info ts=2020-03-30T02:27:09.523306206Z caller=bucket.go:487 msg="synchronizing block metadata"
level=info ts=2020-03-30T02:27:12.627788832Z caller=bucket.go:500 msg="downloaded blocks meta.json" num=101
level=warn ts=2020-03-30T02:29:27.744301881Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="error executing compaction: compaction: group 0@17241709254077376921: upload of 01E4M5KHECGZQ1BV66E6SQGH78 failed: empty external labels are not allowed for Thanos block."
level=info ts=2020-03-30T02:29:27.744689058Z caller=http.go:81 service=http/server component=compact msg="internal server shutdown" err="error executing compaction: compaction: group 0@17241709254077376921: upload of 01E4M5KHECGZQ1BV66E6SQGH78 failed: empty external labels are not allowed for Thanos block."
level=info ts=2020-03-30T02:29:27.744812002Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="error executing compaction: compaction: group 0@17241709254077376921: upload of 01E4M5KHECGZQ1BV66E6SQGH78 failed: empty external labels are not allowed for Thanos block."
level=error ts=2020-03-30T02:29:27.745248074Z caller=main.go:212 err="group 0@17241709254077376921: upload of 01E4M5KHECGZQ1BV66E6SQGH78 failed: empty external labels are not allowed for Thanos block.\ncompaction\nmain.runCompact.func5\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:377\nmain.runCompact.func6.1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:428\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:72\nmain.runCompact.func6\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:427\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\nerror executing compaction\nmain.runCompact.func6.1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:455\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:72\nmain.runCompact.func6\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:427\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\ncompact command failed\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:212\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"

Yes, we can add one more label for new data. But what to do with our months of historical data awaiting this feature? Is it possible to add some --default.label to thanos-compact to apply when there are no labels after deduplication? Or can it leave first label found by --deduplication.replica-label and just remove others?

bwplotka commented 4 years ago

Oh. Yes, that's a bug. Help wanted to fix it. (: --label-if-empty flag might do I think... (:

On Mon, 30 Mar 2020 at 08:13, Alex R notifications@github.com wrote:

Great news! Unfortunately this does not work for our current setup and historical data. We run 3 thanos-receive with --label=replica="$(NAME)" Then in bucket inspect we have:

| 01E4N18DW7663APZ5V0B3D3CY1 | 30-03-2020 05:30:00 | 30-03-2020 06:00:00 | 30m0s | 39h30m0s | 545,049 | 50,824,911 | 545,054 | 1 | false | replica=thanos-receive-0 | 0s | receive | | 01E4N18DW8A4Z0ES2GCXBFEVVH | 30-03-2020 05:30:00 | 30-03-2020 06:00:00 | 30m0s | 39h30m0s | 545,049 | 50,824,911 | 545,054 | 1 | false | replica=thanos-receive-2 | 0s | receive | | 01E4N18DW8AR0CRXN0F4N0F7VY | 30-03-2020 05:30:00 | 30-03-2020 06:00:00 | 30m0s | 39h30m0s | 545,049 | 50,824,911 | 545,054 | 1 | false | replica=thanos-receive-1 | 0s | receive |

And thanos-compact (master-2020-03-27-4e5c2f25) cannot process that due to: empty external labels are not allowed for Thanos block

level=info ts=2020-03-30T02:24:48.810475048Z caller=compact.go:441 msg="compact blocks" count=3 mint=1583366400000 maxt=1584576000000 ulid=01E4M5KHECGZQ1BV66E6SQGH78 sources="[01E3V4JCD9Z5BK5B8A5KSYGMV3 01E3WMG1NXY78KB6GKXR594FMS 01E3T3609JKQ5V1TKZ8NYTGE58]" duration=4h13m14.923098503s level=info ts=2020-03-30T02:24:48.955331391Z caller=compact.go:704 compactionGroup=0@{} compactionGroupKey=0@17241709254077376921 msg="compacted blocks" blocks="[/data/compact/0@17241709254077376921/01E3V4JCD9Z5BK5B8A5KSYGMV3 /data/compact/0@17241709254077376921/01E3WMG1NXY78KB6GKXR594FMS /data/compact/0@17241709254077376921/01E3T3609JKQ5V1TKZ8NYTGE58]" duration=4h13m15.067940785s overlapping_blocks=true level=info ts=2020-03-30T02:27:09.523306206Z caller=bucket.go:487 msg="synchronizing block metadata" level=info ts=2020-03-30T02:27:12.627788832Z caller=bucket.go:500 msg="downloaded blocks meta.json" num=101 level=warn ts=2020-03-30T02:29:27.744301881Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="error executing compaction: compaction: group 0@17241709254077376921: upload of 01E4M5KHECGZQ1BV66E6SQGH78 failed: empty external labels are not allowed for Thanos block." level=info ts=2020-03-30T02:29:27.744689058Z caller=http.go:81 service=http/server component=compact msg="internal server shutdown" err="error executing compaction: compaction: group 0@17241709254077376921: upload of 01E4M5KHECGZQ1BV66E6SQGH78 failed: empty external labels are not allowed for Thanos block." level=info ts=2020-03-30T02:29:27.744812002Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="error executing compaction: compaction: group 0@17241709254077376921: upload of 01E4M5KHECGZQ1BV66E6SQGH78 failed: empty external labels are not allowed for Thanos block." level=error ts=2020-03-30T02:29:27.745248074Z caller=main.go:212 err="group 0@17241709254077376921: upload of 01E4M5KHECGZQ1BV66E6SQGH78 failed: empty external labels are not allowed for Thanos block.\ncompaction\nmain.runCompact.func5\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:377\nmain.runCompact.func6.1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:428\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:72\nmain.runCompact.func6\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:427\ngithub.com/oklog/run.(Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\nerror <http://github.com/thanos-io/thanos/cmd/thanos/compact.go:377%5Cnmain.runCompact.func6.1%5Cn%5Ct/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:428%5Cngithub.com/thanos-io/thanos/pkg/runutil.Repeat%5Cn%5Ct/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:72%5Cnmain.runCompact.func6%5Cn%5Ct/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:427%5Cngithub.com/oklog/run.(Group).Run.func1%5Cn%5Ct/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38%5Cnruntime.goexit%5Cn%5Ct/usr/local/go/src/runtime/asm_amd64.s:1357%5Cnerror> executing compaction\nmain.runCompact.func6.1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:455\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:72\nmain.runCompact.func6\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:427\ngithub.com/oklog/run.(Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\ncompact <http://github.com/thanos-io/thanos/cmd/thanos/compact.go:455%5Cngithub.com/thanos-io/thanos/pkg/runutil.Repeat%5Cn%5Ct/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:72%5Cnmain.runCompact.func6%5Cn%5Ct/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:427%5Cngithub.com/oklog/run.(Group).Run.func1%5Cn%5Ct/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38%5Cnruntime.goexit%5Cn%5Ct/usr/local/go/src/runtime/asm_amd64.s:1357%5Cncompact> command failed\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:212\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357 http://github.com/thanos-io/thanos/cmd/thanos/main.go:212%5Cnruntime.main%5Cn%5Ct/usr/local/go/src/runtime/proc.go:203%5Cnruntime.goexit%5Cn%5Ct/usr/local/go/src/runtime/asm_amd64.s:1357"

Yes, we can add one more label for new data. But what to do with our months of historical data awaiting this feature? Is it possible to add some --default.label to thanos-compact to apply when there are no labels after deduplication? Or can it leave first label found by --deduplication.replica-label and just remove others?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/1014#issuecomment-605824492, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3O42DVURXM6HTDLRVB3RKBBB7ANCNFSM4HEB6SWQ .

stale[bot] commented 4 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka commented 4 years ago

Vertical compaction works, well, but we need still something suitable for Prometheus replicas. Chasing last dedup bugs, so this is blocked for now.

bwplotka commented 4 years ago

Blocked by: https://github.com/thanos-io/thanos/issues/2547

RahulArora31 commented 4 years ago

@bwplotka amazing work on Thanos. I did a POC in my testing cluster so that we have HA for Prometheus and also decoupling of data from the local cluster. I am amazed by the arch. Good job :D

I have a few questions though, im sorry if they have already been answered before

  1. In query component you can use the deduplication checkbox to filter data based on replica labels. So does thanos store the same metric series (replica=A and replica=B) or makes the series one while performing operations ?
  2. I also tried your Receiver approach as one of our prometheus is in the Openshift cluster and that does not have GRPC enabled at the memory. Noticed that all of a sudden Recieve showed (saving wal locally, failed to upload) for one of the receiver pod.
  3. After restart, the hashing must have changed which is the expected behaviour, but the memory shot up from 4Gi to 47Gi eventually going into OOM Killed. Do we need more memory than 47Gi or is it some kind of a memory leak ?
bwplotka commented 4 years ago

I think those questions are unrelated to the issue, I think it would be better for you to ask them on slack instead. (:

bwplotka commented 4 years ago

Last blocker for this issue: https://github.com/thanos-io/thanos/issues/981

stale[bot] commented 4 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka commented 4 years ago

still to be done - last deduplication problems were just fixed, so should be doable.

On Thu, 18 Jun 2020 at 07:43, stale[bot] notifications@github.com wrote:

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command https://probot.github.io/apps/reminders/ if you wish to be reminded at some point in future.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/1014#issuecomment-645813376, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3O3M2IGTYVLN5FIOTZLRXGZQVANCNFSM4HEB6SWQ .

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Reamer commented 3 years ago

Hello @bwplotka, can you please give an update on this feature?

bwplotka commented 3 years ago

:wave:

We are adding new improvements to runtime deduplication so we can safely use it: https://github.com/thanos-io/thanos/issues/2890 This time, improving penalty algorithm. There might be one tricky part about counter reset. In Prometheus there is not metric type information yet, so we are discussing online how we can add that thanks to recent metadata changes to block (cc @pracucci @brian-brazil). If not we will need to guess it from data which needs more testing.

In the mean time we could potentially allow deduplication with backup, so you still back up blocks in some remote location (another cold storage bucket), so we can revert things if needed. Without back up I would not be confident to allow offline dedup for our Thanos users right now, before we handle those missing bits (:

brian-brazil commented 3 years ago

There's some initial discussions about type metadata for remote write via the WAL, getting them into blocks is a completely different kettle of fish.

bwplotka commented 3 years ago

Also Chunk iterator work is almost done which will be needed for dedup during compaction.

Yea agree, super unlikely for now.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Reamer commented 3 years ago

Still important

eightnoteight commented 3 years ago

Hi @bwplotka , interested in this issue, is there anyway I can contribute on this ?

Antiarchitect commented 3 years ago

Also interested very much. Please update the status.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Reamer commented 3 years ago

Still interested.

bwplotka commented 3 years ago

👋🏽

Recently got many questions on DM, please ask them here around offline dedup (: Some update:

We added more docs hopefully cleaning WHY and HOW: https://thanos.io/tip/components/compact.md/#vertical-compactions

Answering one DM:

I saw on the compactor docs that the compactor supports Vertical Compactions via a hidden flag. Our use-case for wanting to implement that is covered under the use-cases listed right within that doc, specifically as the realistic duplication under the Offline deduplication of series bullet point. I’m a bit confused though. Is this something we can use now using the flags: --compact.enable-vertical-compaction --deduplication.replica-label="prometheus_replica" (The prometheus_replica external label is what we use to distinguish between our each Prometheus within a Prometheus HA pair) It looks like according to the docs that should be okay, however, I also found this issue you’ve been pretty active on:

It's NOT ok, unfortunately. Vertical compaction is implemented for one-to-one duplications, which comes from https://thanos.io/tip/components/receive.md/ deduplication OR when you have for some reason duplicated block with exactly same data.

If you use this against Prometheus replicas it will most likely totally mess your querying experience as it just concatenates samples together. So scrape interval in best case is 2x of original, in the worst case it's totally unstable.

The missing part is adding the deduplication algorithm we have online on the query to the compaction stage so we can leverage that. This algorithm also works on 99% of cases which is fine for the query part, when you can just switch deduplication off. If this 1% happens on offline dedup, then you cannot revert this, that's a problem.

We are exploring different deduplication algorithms that will make this much more reliable. We also need something ideally for Query pushdown: https://docs.google.com/document/d/1ajMPwVJYnedvQ1uJNW2GDBY6Q91rKAw3kgA5fykIRrg/edit# so ... help wanted (:

We can try to enable 99% realistic dedup if we want, help wanted for this 🤗

heyitsmdr commented 3 years ago

Thank you @bwplotka for answering that question and confirming my suspicions. I'm going to look at the doc and related work and see if this is something I can help out with :) We would greatly benefit from a feature like this to dedupe the duplicated data coming from each Prometheus in the HA pair.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Reamer commented 3 years ago

This issue is still needed. @bwplotka it would be nice if Thanos would implement the 99% solution.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

dpwolfe commented 3 years ago

Please keep this open.

2nick commented 3 years ago

I took a look at the querier's deduplication and compaction process and have some questions. :)

For example, we have "main" and "replica" instances of Prometheus.

They could start asynchronously and as a result, it's possible to have next blocks in s3: main: block1[1am - 3am], block2[3am - 5am] replica: block3[2am - 5am]

And of course, such overlapping blocks could be much more.

So the first question is - how should work planner and compactor?

The only option I see is that planner makes 2 groups like:

  1. block1+block3
  2. block2+block3

To take from block3 only parts which complement "main" blocks.

But such behavior sound to be a bit sophisticated - maybe there are better options/thoughts about this thing? :)

yeya24 commented 3 years ago

@2nick I am looking at the same thing. It is a little bit sophisticated and I think the grouping approach in #1276 is similar to what you mentioned.

IMO the grouping part of offline deduplication can reuse the existing tsdb planner. We don't need to group into 2 groups in this case. We can do it in 2 iterations:

  1. block1 + block3 -> new block [1am - 5am]
  2. new block + block2 -> result [1am - 5am]

WDYT? But I agree this is still a not very good approach. It would cost a lot if the overlap is small, but we still need to iterate the whole block to compact.

2nick commented 3 years ago

After some thinking I've decided that it's "good enough" approach as it allows to move forward. :)

Really great that you are in it! Looking forward to offline dedup! :)