grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.02k stars 509 forks source link

Intermittent panic in queriers while issuing a native histogram query #8931

Open rishabhkumar92 opened 1 month ago

rishabhkumar92 commented 1 month ago

Describe the bug

Parser error in queriers while issuing a native histogram query.

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir 2.12.0
  2. Issue a native histogram query.

Example query: (avg_over_time(histogram_quantile(0.95, sum(rate(services_platform_service_response{mode=\"xyz\",tier=~\"perf-01\",success=\"true\"}[1m])))[2h11m2s:]) - avg_over_time(histogram_quantile(0.95, sum(rate(services_platform_service_response{mode=\"xyz\",tier=~\"perf-02\",success=\"true\"}[1m])))[2h11m2s:])) / avg_over_time(histogram_quantile(0.95, sum(rate(services_platform_service_response{mode=\"xyz\",tier=~\"perf-02\",success=\"true\"}[1m])))[2h11m2s:]) * 100

  1. Queriers are failing intermittently with the parser panic error.

{"caller":"engine.go:1045","err":"runtime error: index out of range [106] with length 106","expr":"(avg_over_time(histogram_quantile(0.95, sum(rate(services_platform_service_response{mode=\"xyz\,tier=~\"perf-01\",success=\"true\"}[1m])))[2h11m2s:]) - avg_over_time(histogram_quantile(0.95, sum(rate(services_platform_service_response{mode=\"xyz\,tier=~\"perf-02\",success=\"true\"}[1m])))[2h11m2s:])) / avg_over_time(histogram_quantile(0.95, sum(rate(services_platform_service_response{mode=\"xyz\,tier=~\"perf-02\",success=\"true\"}[1m])))[2h11m2s:]) * 100","level":"error","msg":"runtime panic in parser","stacktrace":"goroutine 1481902038 [running]:\ngithub.com/prometheus/prometheus/promql.(*evaluator).recover(0x402739c3c0, {0x2bb9280, 0x4026f77240}, 0x402e1a3ed8, 0x402e1a3ef0)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1043 +0x21c\npanic({0x219bfe0?, 0x402f935ad0?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x218\ngithub.com/prometheus/prometheus/model/histogram.addBuckets(0x288bc750?, 0x255bba08cf8c979d, 0x1, {0x404ec3eaa0?, 0x0?, 0x0?}, {0x40288f4700?, 0x160425c?, 0x4027f38f30?}, {0x408ebfa500, ...}, ...)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/model/histogram/float_histogram.go:1084 +0xa38\ngithub.com/prometheus/prometheus/model/histogram.(*FloatHistogram).Sub(0x40288bc750, 0x4027f38bd0)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/model/histogram/float_histogram.go:356 +0x390\ngithub.com/prometheus/prometheus/promql.histogramRate({0x4023365700, 0x1f2ec?, 0x236f2f0?}, 0x1, {0x401dfdbb0a, 0x33}, {0xffffac64ba90?, 0x1ae103f?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/functions.go:214 +0x558\ngithub.com/prometheus/prometheus/promql.extrapolatedRate({0x40274262d0?, 0x15f92c0?, 0x2b89ef0?}, {0x401db236b0, 0x10?, 0x4023365730?}, 0x4027d7c3f0, 0x1, 0x1)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/functions.go:98 +0x1b4\ngithub.com/prometheus/prometheus/promql.funcRate({0x40274262d0?, 0x40210d17c0?, 0x4028413400?}, {0x401db236b0?, 0x0?, 0x0?}, 0x0?)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/functions.go:240 +0x2c\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c420, {0x2bb92c0?, 0x402706f7d0?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1513 +0x34fc\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0x402739c420, 0x402e19b0b0, 0x402e19b080, {0x402e19b010, 0x2, 0x0?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1122 +0x164\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c420, {0x2bb9240?, 0x402739c060?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1359 +0x40e4\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0x402739c420, 0x0, 0x402e19c790, {0x4024dc2720, 0x2, 0x0?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1122 +0x164\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c420, {0x2bb92c0?, 0x402706f800?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1415 +0x1f0c\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c3c0, {0x2bb9300?, 0x401d19b5e0?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1779 +0xf34\ngithub.com/prometheus/prometheus/promql.(*evaluator).evalSubquery(0x402739c3c0, 0x401d19b5e0)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1298 +0x90\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c3c0, {0x2bb92c0?, 0x402706f830?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1402 +0x3bf8\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0x402739c3c0, 0x402e19fbb0, 0x402e19fc38, {0x402e19fd20, 0x2, 0x0?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1122 +0x164\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c3c0, {0x2bb9280?, 0x4026f77140?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1660 +0x748\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c3c0, {0x2bb9340?, 0x4024dc2860?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1614 +0x360\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0x402739c3c0, 0x402e1a2200, 0x402e1a2288, {0x402e1a2370, 0x2, 0x0?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1122 +0x164\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c3c0, {0x2bb9280?, 0x4026f771c0?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1660 +0x748\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0x402739c3c0, 0x0, 0x402e1a3a10, {0x402e1a3af0, 0x2, 0x0?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1122 +0x164\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0x402739c3c0, {0x2bb9280?, 0x4026f77240?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1666 +0x9a8\ngithub.com/prometheus/prometheus/promql.(*evaluator).Eval(0x1912b6960a0?, {0x2bb9280?, 0x4026f77240?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:1060 +0x80\ngithub.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0x4002abb900, {0x2bb5de8, 0x402706fec0}, 0x4003cbe310, 0x401d19b400)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:713 +0x3f4\ngithub.com/prometheus/prometheus/promql.(*Engine).exec(0x4002abb900, {0x2bb5de8, 0x402706fec0}, 0x4003cbe310)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:651 +0x288\ngithub.com/prometheus/prometheus/promql.(*query).Exec(0x4003cbe310, {0x2bb5de8, 0x402706fbc0})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/promql/engine.go:239 +0x154\ngithub.com/prometheus/prometheus/web/api/v1.(*API).query(0x400079b8c0, 0x400e7cf100)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/web/api/v1/api.go:463 +0x65c\ngithub.com/prometheus/prometheus/web/api/v1.(*API).Register.(*API).Register.func2.func3(0x2bb1da0?)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/web/api/v1/api.go:356 +0xe0\ngithub.com/prometheus/prometheus/web/api/v1.(*API).Register.func1.1({0x2bb1da0, 0x4024dc2600}, 0x401db23600?)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/web/api/v1/api.go:331 +0x64\nnet/http.HandlerFunc.ServeHTTP(0x2bb1b60?, {0x2bb1da0?, 0x4024dc2600?}, 0x1b04?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP({{0x2b95640?, 0x4002b34480?}}, {0x2bb1b60?, 0x4026f77000?}, 0x401fef0a98?)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/util/httputil/compression.go:91 +0x6c\ngithub.com/prometheus/common/route.(*Router).handle.func1({0x2bb1b60, 0x4026f77000}, 0x400e7cf000, {0x0, 0x0, 0x40078a7130?})\n\t/__w/mimir/mimir/vendor/github.com/prometheus/common/route/route.go:83 +0x1b0\ngithub.com/julienschmidt/httprouter.(*Router).ServeHTTP(0x4002aeb5c0, {0x2bb1b60, 0x4026f77000}, 0x400e7cf000)\n\t/__w/mimir/mimir/vendor/github.com/julienschmidt/httprouter/router.go:387 +0x6f8\ngithub.com/prometheus/common/route.(*Router).ServeHTTP(0x4051b3ed78?, {0x2bb1b60?, 0x4026f77000?}, 0x3a3f80?)\n\t/__w/mimir/mimir/vendor/github.com/prometheus/common/route/route.go:126 +0x28\ngithub.com/grafana/mimir/pkg/api.NewQuerierHandler.(*RequestsMiddleware).Wrap.func11({0x2bb1b60, 0x4026f77000}, 0x0?)\n\t/__w/mimir/mimir/pkg/usagestats/middleware.go:25 +0x68\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x2bb1b60?, 0x4026f77000?}, 0x18?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/grafana/mimir/pkg/api.NewQuerierHandler.ConsistencyMiddleware.func9.1({0x2bb1b60, 0x4026f77000}, 0x400e7cf000)\n\t/__w/mimir/mimir/pkg/querier/api/consistency.go:58 +0xa4\nnet/http.HandlerFunc.ServeHTTP(0x18?, {0x2bb1b60?, 0x4026f77000?}, 0x8a8870?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/grafana/dskit/middleware.Instrument.Wrap-fm.Instrument.Wrap.func1.2({0x2bb1b60?, 0x4026f77000?})\n\t/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:74 +0x40\ngithub.com/felixge/httpsnoop.(*Metrics).CaptureMetrics(0x401231aa98, {0x2bb1b60, 0x4026f76fc0}, 0x401fef0fc8)\n\t/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:84 +0x184\ngithub.com/felixge/httpsnoop.CaptureMetricsFn({0x2bb1b60, 0x4026f76fc0}, 0x4000da4a20?)\n\t/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:39 +0x50\ngithub.com/grafana/dskit/middleware.Instrument.Wrap-fm.Instrument.Wrap.func1({0x2bb1b60, 0x4026f76fc0}, 0x400e7cf000)\n\t/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:73 +0x218\nnet/http.HandlerFunc.ServeHTTP(0x400e7cef00?, {0x2bb1b60?, 0x4026f76fc0?}, 0x1f0e8?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/gorilla/mux.(*Router).ServeHTTP(0x4002a8c540, {0x2bb1b60, 0x4026f76fc0}, 0x400e7cee00)\n\t/__w/mimir/mimir/vendor/github.com/gorilla/mux/mux.go:212 +0x194\ngithub.com/grafana/mimir/pkg/api.NewQuerierHandler.WallTimeMiddleware.Wrap.func26({0x2bb1b60, 0x4026f76fc0}, 0xffff65068080?)\n\t/__w/mimir/mimir/pkg/querier/stats/time_middleware.go:30 +0x74\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x2bb1b60?, 0x4026f76fc0?}, 0x3a577c?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/grafana/mimir/pkg/api.(*API).newRoute.ConsistencyMiddleware.func1.1({0x2bb1b60, 0x4026f76fc0}, 0x400e7cee00)\n\t/__w/mimir/mimir/pkg/querier/api/consistency.go:58 +0xa4\nnet/http.HandlerFunc.ServeHTTP(0x400e7ced00?, {0x2bb1b60?, 0x4026f76fc0?}, 0x89172c?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/grafana/mimir/pkg/api.New.newTenantValidationMiddleware.func1.1({0x2bb1b60, 0x4026f76fc0}, 0x4051b3f3a8?)\n\t/__w/mimir/mimir/pkg/api/tenant.go:43 +0x134\nnet/http.HandlerFunc.ServeHTTP(0x400e7cec00?, {0x2bb1b60?, 0x4026f76fc0?}, 0x402706f3b0?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/grafana/dskit/middleware.glob..func2.1({0x2bb1b60, 0x4026f76fc0}, 0x400e7cec00)\n\t/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/http_auth.go:21 +0x100\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x2bb1b60?, 0x4026f76fc0?}, 0x4?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/grafana/mimir/pkg/util/gziphandler.GzipHandlerWithOpts.func1.1({0x2bb1b60, 0x4026f76fc0}, 0x4002acfe00?)\n\t/__w/mimir/mimir/pkg/util/gziphandler/gzip.go:366 +0x220\nnet/http.HandlerFunc.ServeHTTP(0x400e7ceb00?, {0x2bb1b60?, 0x4026f76fc0?}, 0x16bb04?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/gorilla/mux.(*Router).ServeHTTP(0x4000167680, {0x2bb1b60, 0x4026f76fc0}, 0x400e7cea00)\n\t/__w/mimir/mimir/vendor/github.com/gorilla/mux/mux.go:212 +0x194\ngithub.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1.2({0x2bb1b60?, 0x4026f76fc0?})\n\t/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:74 +0x40\ngithub.com/felixge/httpsnoop.(*Metrics).CaptureMetrics(0x401231aa50, {0xffff6500f3d0, 0x402739c000}, 0x4051b3f898)\n\t/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:84 +0x184\ngithub.com/felixge/httpsnoop.CaptureMetricsFn({0xffff6500f3d0, 0x402739c000}, 0x4000da4170?)\n\t/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:39 +0x50\ngithub.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1({0xffff6500f3d0, 0x402739c000}, 0x400e7cea00)\n\t/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:73 +0x218\nnet/http.HandlerFunc.ServeHTTP(0x2bad680?, {0xffff6500f3d0?, 0x402739c000?}, 0x402706f260?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/grafana/dskit/middleware.(*Log).Wrap.Log.Wrap.func1({0x2bad680, 0x402706f200}, 0x400e7cea00)\n\t/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/logging.go:88 +0x200\nnet/http.HandlerFunc.ServeHTTP(0xf8?, {0x2bad680?, 0x402706f200?}, 0x893d4c?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5({0x2bad5c0?, 0x4026f76e80}, 0x400e7ce900)\n\t/__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:159 +0x3e8\nnet/http.HandlerFunc.ServeHTTP(0x2bb5de8?, {0x2bad5c0?, 0x4026f76e80?}, 0x1f0e8?)\n\t/usr/local/go/src/net/http/server.go:2136 +0x38\ngithub.com/grafana/dskit/httpgrpc/server.Server.Handle({{0x2b95640?, 0x40007f4bc0?}, 0x0?}, {0x2bb5de8?, 0x402706ef90?}, 0x402df80?)\n\t/__w/mimir/mimir/vendor/github.com/grafana/dskit/httpgrpc/server/server.go:68 +0xdc\ngithub.com/grafana/mimir/pkg/querier/worker.(*frontendProcessor).runRequest(0x4002ae8ec0, {0x2bb5e20?, 0x40166ec320?}, 0x822a4?, 0x1, 0x802f, 0x4024dc2540)\n\t/__w/mimir/mimir/pkg/querier/worker/frontend_processor.go:160 +0x160\ncreated by github.com/grafana/mimir/pkg/querier/worker.(*frontendProcessor).process in goroutine 3224070\n\t/__w/mimir/mimir/pkg/querier/worker/frontend_processor.go:122 +0x288\n","ts":"2024-08-07T20:46:14.976037471Z"}

Expected behavior

Environment

Additional Context

I am not able to find correlated panic parse error in recent reports or fixes.

charleskorn commented 1 month ago

The runtime panic in parser message is a bit misleading here - this is a panic during query evaluation.

I suspect this is caused by the same underlying issue as https://github.com/grafana/mimir/issues/8889.

@krajorama could you take a look at this?

charleskorn commented 1 month ago

https://github.com/prometheus/prometheus/pull/14621 will make the error message less misleading.

krajorama commented 1 month ago

Agree with @charleskorn that it might be the same thing. Mimir 2.12 includes optimizations in Prometheus that had a couple of issues :( We've fixed 3 in the last two weeks. Mimir's next week's release r303 should contain all fixes so far. Would you be able to test it?

rishabhkumar92 commented 1 month ago

yeah I can test that next week. Also, @krajorama is there possibility to backport this change to 2.12(preferred) or 2.13?

rishabhkumar92 commented 1 month ago

bumping this up, @krajorama let me know if it will be possible to do backporting and also on r303 release dates?

krajorama commented 1 month ago

Hi,

we build a new weekly release every Monday, it's already out: https://hub.docker.com/layers/grafana/mimir/r303-7a16e72/images/sha256-9e38992f6e83c7c68153e9095364f265168dfe7bb5c16ee6de91368b6646f4a4?context=repo

We usually do not backport fixes, only security patches. I'll ask the PM.

rishabhkumar92 commented 1 month ago

Sure, bumping this again, will it be possible to backport this to 2.13 which helps in easier migration and not needing to deal with unreleased changes.

krajorama commented 1 week ago

Hi, my apologies, but ultimately this was not prioritized and we didn't have the bandwidth for testing and releasing a patch version due to other projects and also upcoming PromCon next week.