thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

thanos-query segv panic on labels regexp #7676

Open gberche-orange opened 2 months ago

gberche-orange commented 2 months ago

Thanos, Prometheus and Golang version used:

Object Storage Provider: Scality

What happened:

thanos-query segv panic apparently on labels regexp according to stack trace.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

I am not yet able to identify the query that triggers this behavior.

Full logs to relevant components:

Logs

``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x941726] goroutine 10921 [running]: github.com/prometheus/prometheus/model/labels.(*FastRegexMatcher).MatchString(...) /bitnami/blacksmith-sandox/thanos-0.36.1/pkg/mod/github.com/prometheus/prometheus@v0.52.2-0.20240614130246-4c1e71fa0b3d/model/labels/regexp.go:306 github.com/prometheus/prometheus/model/labels.(*Matcher).Matches(0x25b92a0?, {0x7ffc0c37c45c?, 0xc00114e6c8?}) /bitnami/blacksmith-sandox/thanos-0.36.1/pkg/mod/github.com/prometheus/prometheus@v0.52.2-0.20240614130246-4c1e71fa0b3d/model/labels/matcher.go:115 +0xa6 github.com/thanos-io/thanos/pkg/store.LabelSetsMatch({0xc000f26118, 0x1, 0x407d86?}, {0xc000ae1748?, 0x1, 0x7f4cabf9e3e8?}) /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:668 +0x16d github.com/thanos-io/thanos/pkg/store.storeMatchDebugMetadata({0x38f2670?, 0xc000d8cb00?}, {0xc0017cd560?, 0x2, 0x2}) /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:651 +0x17a github.com/thanos-io/thanos/pkg/store.storeMatches({0x38db378?, 0xc0017cd860?}, {0x38f2670, 0xc000d8cb00}, 0x1919ea62bc0, 0x1919edd1a40, {0xc000f26150, 0x1, 0x1}) /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:627 +0x1fa github.com/thanos-io/thanos/pkg/store.(*ProxyStore).matchingStores(0xc0009f7cc0, {0x38db378, 0xc0017cd860}, 0x2?, 0x0?, {0xc000f26150, 0x1, 0x1}) /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:595 +0x16e github.com/thanos-io/thanos/pkg/store.(*ProxyStore).Series(0xc0009f7cc0, 0xc000bc4960, {0x38e7ef0, 0xc0016a4720}) /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:320 +0x5f8 github.com/thanos-io/thanos/pkg/query.(*querier).selectFn(0xc000bc48c0, {0x38db378, 0xc0017cd740}, 0xc000a52380, {0xc000f260f8?, 0x0?, 0x0?}) /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/query/querier.go:346 +0x328 github.com/thanos-io/thanos/pkg/query.(*querier).Select.func1() /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/query/querier.go:293 +0x21d created by github.com/thanos-io/thanos/pkg/query.(*querier).Select in goroutine 10108 /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/query/querier.go:277 +0x5a9 ```

Anything else we need to know:

dosubot[bot] commented 2 months ago

Hey there, @gberche-orange! I'm here to help you with any bugs, questions, or contributions you might have. Let's see if we can tackle this issue together!

I found a similar closed issue: receive: Query failure on Seg fault which was last updated on January 27, 2023. This issue was referenced by several other discussions and pull requests, including:

The suggested solution from the duplicate issue discussion points to a possible culprit related to gogo/protobuf#651. The recommended fix is to move away from the unmaintained gogoprotobuf library [1].

Additionally, the LabelSetsMatch function in proxy.go is involved in the stack trace you provided. Here is its implementation:

// LabelSetsMatch returns false if all label-set do not match the matchers (aka: OR is between all label-sets).
func LabelSetsMatch(matchers []*labels.Matcher, lset ...labels.Labels) bool {
    if len(lset) == 0 {
        return true
    }

    for _, ls := range lset {
        notMatched := false
        for _, m := range matchers {
            if lv := ls.Get(m.Name); ls.Has(m.Name) && !m.Matches(lv) {
                notMatched = true;
                break;
            }
        }
        if !notMatched {
            return true;
        }
    }
    return false;
}

This function checks if any of the provided label sets match the given matchers [2].

To address the panic, you might want to investigate the FastRegexMatcher and Matcher implementations in the Prometheus labels package, as the issue seems to be related to a nil pointer dereference in the FastRegexMatcher.

To continue talking to Dosu, mention @dosu.

GiedriusS commented 2 months ago

Do you have any components that talk to the Query component directly through the StoreAPI?

MichaHoffmann commented 2 months ago

There was a bug in the distributed query engine where we didn't propagate matchers correctly. I think that should be fixed in this but just to make sure... Is this the distributed engine? If not does it happen for the Prometheus engine too?

gberche-orange commented 2 months ago

Thanks for your responses

Do you have any components that talk to the Query component directly through the StoreAPI?

No, our thanos instance is only queried by Grafana through the query front-end AFAIK

Is this the distributed engine? If not does it happen for the Prometheus engine too?

No. We don't use distributed engine (parameter not set --query.engine=distributed on the query pod, since we were having issues with it as mentioned in https://github.com/thanos-io/thanos/issues/7328 )

chris-barbour-as commented 2 weeks ago

Possible that this is related to the problem I'm experiencing in #7844?

Does reverting to Thanos 0.35.1 make the problem go away?

gberche-orange commented 3 days ago

Possible that this is related to the problem I'm experiencing in #7844?

Does reverting to Thanos 0.35.1 make the problem go away?

Thanks @chris-barbour-as for the heads up! Yes, reverting to Thanos 0.35.1 (through bitnami helm chart version https://artifacthub.io/packages/helm/bitnami/thanos/15.7.15) resolved the issue for me.