thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

tools bucket rewrite: invalid memory address or nil pointer dereference #4298

Open onelife opened 3 years ago

onelife commented 3 years ago

I'm trying the series deletion feature and got the following error.

level=info ts=2021-06-03T04:34:56.077967157Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2021-06-03T04:34:56.116875761Z caller=tools_bucket.go:868 msg="downloading block" source=01F6NY6XFBHZSQ159ZYF5FGE61
level=info ts=2021-06-03T04:34:59.814395054Z caller=tools_bucket.go:904 msg="changelog will be available" file=/tmp/thanos-rewrite/01F782EC369F5RPSZQSAZ45CQ5/change.log
level=info ts=2021-06-03T04:34:59.831787434Z caller=tools_bucket.go:919 msg="starting rewrite for block" source=01F6NY6XFBHZSQ159ZYF5FGE61 new=01F782EC369F5RPSZQSAZ45CQ5 toDelete="- matchers: \"{__name__=~\\\"mqtt2tsdb_.*\\\",gateway=\\\"LG01010002012110100\\\"}\""
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17aba71]

goroutine 101 [running]:
github.com/thanos-io/thanos/pkg/compactv2.(*lazyPopulatableChunk).Bytes(0xc0000a6240, 0x8, 0xc0000d91c8, 0x40db9b)
        /app/pkg/compactv2/chunk_series_set.go:119 +0x31
github.com/prometheus/prometheus/tsdb/chunks.(*Writer).WriteChunks(0xc0000d0960, 0xc0000cc140, 0x5, 0x8, 0xa0, 0xc0000cc140)
        /go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20210421143221-52df5ef7a3be/tsdb/chunks/chunks.go:302 +0x11a
github.com/thanos-io/thanos/pkg/block.(*statsGatheringSeriesWriter).WriteChunks(0xc00003afc0, 0xc0000cc140, 0x5, 0x8, 0x0, 0x0)
        /app/pkg/block/writer.go:172 +0x5f
github.com/thanos-io/thanos/pkg/compactv2.(*Compactor).write(0xc0000d9d68, 0x1fa0228, 0xc000862c00, 0x1f96f50, 0xc0000d0000, 0x1fa0f80, 0xc0000d0050, 0x7f74fb69eff0, 0xc00003afc0, 0x1f6ff20, ...)
        /app/pkg/compactv2/chunk_series_set.go:200 +0x427
github.com/thanos-io/thanos/pkg/compactv2.(*Compactor).WriteSeries(0xc0000d9d68, 0x1fa0228, 0xc000862c00, 0xc0000d9b98, 0x1, 0x1, 0x1fa67b8, 0xc00003afc0, 0x1f6ff20, 0xc000404a80, ...)
        /app/pkg/compactv2/compactor.go:147 +0xb25
main.registerBucketRewrite.func1.1(0x0, 0x0)
        /app/cmd/thanos/tools_bucket.go:920 +0x10f5
github.com/oklog/run.(*Group).Run.func1(0xc0002924e0, 0xc000890a00, 0xc000881b60)
        /go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38 +0x27
created by github.com/oklog/run.(*Group).Run
        /go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:37 +0xbb

Dry run has no issue. The test is done by thanos v0.20.2 and S3. Following is the output of tools bucket inspect.

level=info ts=2021-06-03T04:05:41.6528167Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2021-06-03T04:05:42.1304465Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=461.8639ms cached=187 returned=187 partial=0
|            ULID            |        FROM         |        UNTIL        |     RANGE      |   UNTIL-DOWN    | #SERIES |   #SAMPLES    |  #CHUNKS   | COMP-LEVEL | COMP-FAILED |                                                        LABELS                                                        | RESOLUTION |     SOURCE     |
|----------------------------|---------------------|---------------------|----------------|-----------------|---------|---------------|------------|------------|-------------|----------------------------------------------------------------------------------------------------------------------|------------|----------------|
| 01F11XK5AW6BN0V760D9C4Y5T3 | 04-03-2021 00:00:00 | 18-03-2021 00:00:00 | 335h59m59.864s | -               | 86,658  | 22,437,633    | 286,768    | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 1h0m0s     | compactor      |
| 01F25Z5EGMWKM577FMTYM0718C | 18-03-2021 00:00:00 | 01-04-2021 00:00:00 | 335h59m59.935s | -               | 91,515  | 23,312,170    | 292,946    | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 1h0m0s     | compactor      |
| 01F3A0S4HM9KYV9J5528SM7NS1 | 01-04-2021 00:00:00 | 15-04-2021 00:00:00 | 335h59m59.938s | -               | 106,308 | 25,558,975    | 321,134    | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 1h0m0s     | compactor      |
| 01F4E20BTK668NA3HBNRXN7AKB | 15-04-2021 00:00:00 | 29-04-2021 00:00:00 | 335h59m59.943s | -295h59m59.943s | 100,537 | 3,289,387,764 | 27,425,425 | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 0s         | compactor      |
| 01F4E24RDA6E8WA289YC2KVN8Z | 15-04-2021 00:00:00 | 29-04-2021 00:00:00 | 335h59m59.943s | -95h59m59.943s  | 100,537 | 326,586,223   | 2,366,044  | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F4E2BZ0Y3XADYFHXEPE0VCN7 | 15-04-2021 00:00:00 | 29-04-2021 00:00:00 | 335h59m59.943s | -               | 100,537 | 27,251,580    | 342,084    | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 1h0m0s     | compactor      |
| 01F5J3JRDFXSFVKJNE3T3JWTW0 | 29-04-2021 00:00:00 | 13-05-2021 00:00:00 | 335h59m59.896s | -295h59m59.896s | 162,678 | 3,312,859,543 | 27,657,855 | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 0s         | compactor      |
| 01F5J3QKAMVCD9VGRXD1AHK8DS | 29-04-2021 00:00:00 | 13-05-2021 00:00:00 | 335h59m59.896s | -95h59m59.896s  | 162,678 | 330,299,454   | 2,463,624  | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F5J3ZA3JJZ0H7K4G0N0DCYBG | 29-04-2021 00:00:00 | 13-05-2021 00:00:00 | 335h59m59.896s | -               | 162,678 | 27,804,831    | 406,632    | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 1h0m0s     | compactor      |
| 01F5Q8ANWKCJQA7FE34JW6MPT5 | 13-05-2021 00:00:00 | 15-05-2021 00:00:00 | 47h59m59.942s  | 192h0m0.058s    | 84,767  | 41,735,306    | 370,524    | 3          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F6S0DGVDNJA6FQE3JBXFQ7ZS | 13-05-2021 00:00:00 | 15-05-2021 00:00:00 | 47h59m59.942s  | -7h59m59.942s   | 84,736  | 418,305,020   | 3,489,467  | 3          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 0s         | bucket.rewrite |
| 01F6P52S252GDCNEE6MS6FT1FP | 13-05-2021 00:00:00 | 27-05-2021 00:00:00 | 335h59m59.942s | -295h59m59.942s | 384,795 | 3,841,904,413 | 32,154,121 | 4          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 0s         | compactor      |
| 01F5WD45R5QXQM0HBT98D2Z2AE | 15-05-2021 00:00:00 | 17-05-2021 00:00:00 | 47h59m59.865s  | 192h0m0.135s    | 76,958  | 44,303,448    | 384,616    | 3          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F61RSC2JKFKAPKSEHVYB0839 | 17-05-2021 00:00:00 | 19-05-2021 00:00:00 | 47h59m59.97s   | 192h0m0.03s     | 118,707 | 45,051,844    | 412,647    | 3          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F66PPS2E60WH5JTQ0FDFKP7H | 19-05-2021 02:51:34 | 21-05-2021 00:00:00 | 45h8m25.33s    | 194h51m34.67s   | 320,499 | 55,308,708    | 576,107    | 3          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F6BMMC13E9Y97Z9MKDPAK9H5 | 21-05-2021 00:00:00 | 23-05-2021 00:00:00 | 48h0m0s        | 192h0m0s        | 114,131 | 65,161,317    | 565,736    | 3          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F6H092T470QK29QPJFRRXYTG | 23-05-2021 00:00:00 | 25-05-2021 00:00:00 | 47h59m59.995s  | 192h0m0.005s    | 115,335 | 65,352,232    | 568,464    | 3          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F6NY6XFBHZSQ159ZYF5FGE61 | 25-05-2021 00:00:00 | 27-05-2021 00:00:00 | 48h0m0s        | 192h0m0s        | 121,740 | 65,901,813    | 577,159    | 3          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 5m0s       | compactor      |
| 01F6Q0FVT6X0HYATRCS7DY878M | 27-05-2021 00:00:00 | 27-05-2021 08:00:00 | 7h59m59.989s   | 32h0m0.011s     | 116,982 | 110,876,255   | 925,500    | 2          | false       | prometheus=backend/kprom-kube-prometheus-prometheus,prometheus_replica=prometheus-kprom-kube-prometheus-prometheus-0 | 0s         | compactor      |
...

I tried several (but not all) blocks (source==compactor) and only rewriting block 01F6Q0FVT6X0HYATRCS7DY878M has no error.

yeya24 commented 3 years ago

Thanks for reporting this issue. Bucket rewrite tool only works for not downsampled blocks (res=0) currently.

We need to mention it in the docs.

onelife commented 3 years ago

Hi @yeya24, thanks for the reply!

Before close the issue, may I know the roadmap of this feature? Any plan to support downsampled blocks? Any plan to support deleting series within specified time range?

yeya24 commented 3 years ago

@bwplotka for more input. We definitely want to support it. But it sounds tricky to me to support deletion as you can delete only part of a series by given time ranges.

For the new rewrite relabel cmd, this is easier to do as it works for the whole series.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

yeya24 commented 3 years ago

Not stale.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith commented 3 years ago

Still needed.

bobykus31 commented 2 years ago

If it is not possible to remove metrics from downsampled data, is it even possible to recreate downsampled data from raw metrics where unwanted metrics are deleted?

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith commented 2 years ago

As far as I know, this is still needed.

mortaelth commented 2 years ago

Confirming, still needed, just ran into the issue :)

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

mortaelth commented 2 years ago

Still there :(

iceman91176 commented 2 years ago

Regarding @bobykus31 question

If it is not possible to remove metrics from downsampled data, is it even possible to recreate downsampled data from raw metrics where unwanted metrics are deleted?

Although this might not be efficient it might be a workaround. How does compactor handle such a rewritten block ? Will it just be processed because it is a new block from the compactors view ? Will it be ignored ?

Maybe @yeya24 can answer this ?

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith commented 2 years ago

I think this is still needed

mo4islona commented 1 year ago

Faced same issue. Is there any plan to fix it?

matej-g commented 1 year ago

There's still some WIP to do this, so not forgotten, see this recent closed PR: https://github.com/thanos-io/thanos/pull/5725

lasermoth commented 1 year ago

Any update on this issue? Still needed.

BouchaaraAdil commented 1 year ago

still there:

level=info ts=2023-04-19T13:44:27.718117168Z caller=tools_bucket.go:1227 msg="starting rewrite for block" source=xxxxxxxxxxx new=01GYCW4T9RDJB2J8SVHTKHAS43 toDelete="- matchers: '{__name__=\"container_memory_usage_bytes\", cluster=\"live-k8s\", service=\"kube-test-stack-kubelet\"}'\n\n" toRelabel=
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1ecbb32]

goroutine 113 [running]:
github.com/thanos-io/thanos/pkg/compactv2.(*lazyPopulatableChunk).Bytes(0xc000815470?)
    /app/pkg/compactv2/chunk_series_set.go:126 +0x32
github.com/prometheus/prometheus/tsdb/chunks.(*Writer).WriteChunks(0xc000a6aff0, {0xc00023e480, 0x1, 0x1})
paprickar commented 1 year ago

Do i understand it right that in order to delete a metric:

This then should shrink my s3 costs - correct?

lasermoth commented 1 year ago

@yeya24 You mentioned in https://github.com/thanos-io/thanos/pull/5725#issuecomment-1262465994 that you had an old branch to handle downsampled blocks with bucket rewrite. Is that still valid, or would implementing this need to be re-visited ?

W-Hamra commented 5 months ago

Issue still sadly happening. Same error when dealing with downsampled data. Perhaps a simple error message for the time being that says so instead of segfaulting?