thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.02k stars 2.09k forks source link

index out of range [-1] #2645

Closed maulikjs closed 4 years ago

maulikjs commented 4 years ago

Thanos Version used: v0.12.2

Object Storage Provider: Ceph S3

What happened: When I run queries against historic dates, for certain dates we get a response unexpected error: runtime error: index out of range [-1]

{'field': 'subscribed_cores', 'date': datetime.date(2019, 12, 27), 'runtime': 1.5100421905517578, 'text': '{"status":"error","errorType":"execution","error":"unexpected error: runtime error: index out of range [-1]"}\n', 'value': None, 'status': 'error', 'error': 'unexpected error: runtime error: index out of range [-1]'}

Other than that we also sometimes get an empty response if not the error but never the actual value.

{'field': 'subscribed_cores', 'date': datetime.date(2019, 12, 27), 'runtime': 1.227376699447632, 'text': '{"status":"success","data":{"resultType":"vector","result":[]}}\n', 'value': None, 'status': 'success', 'error': None}

What you expected to happen: Get the value for that date.

CC: @andmatt

GiedriusS commented 4 years ago

Is there some kind of stack trace printed with the error, perhaps in the Thanos Query logs or somewhere else? Where is this printed in general? Because these don't seem similar to our log format. Do you have a way of reproducing it? :thinking:

It's probably impossible to tell where is a (possible) error without a stack trace.

maulikjs commented 4 years ago

Hey @GiedriusS , I was running queries against the API where I got these responses, I manually also went to the UI and ran the query for the date and got the same output in a error box. Let me see if I can replicate it and get a screenshot and the pod logs from that.

maulikjs commented 4 years ago

image

maulikjs commented 4 years ago
level=info ts=2020-05-22T19:13:37.505197904Z caller=frontend.go:166 msg="slow query" org_id=fake url="http://telemeter-lts.datahub.redhat.com/api/v1/query?query=round%28avg_over_time%28sum%28sum+by+%28_id%29+%28max+without+%28prometheus%2Creceive%2Cinstance%29+%28+cluster%3Acapacity_cpu_cores%3Asum%29%29+%2B+on+%28_id%29+group_left%28_blah%29+%280+%2A+%28id_version_ebs_account_internal%3Acluster_subscribed%7B%7D+%3D%3D+1%29%29%29%5B1d%3A30m%5D%29%29&time=1587945600.0 " time_taken=6.143809981s
level=debug ts=2020-05-22T19:13:37.505330807Z caller=logging.go:44 traceID=18f9107c5c008580 msg="GET /api/v1/query?query=round%28avg_over_time%28sum%28sum+by+%28_id%29+%28max+without+%28prometheus%2Creceive%2Cinstance%29+%28+cluster%3Acapacity_cpu_cores%3Asum%29%29+%2B+on+%28_id%29+group_left%28_blah%29+%280+%2A+%28id_version_ebs_account_internal%3Acluster_subscribed%7B%7D+%3D%3D+1%29%29%29%5B1d%3A30m%5D%29%29&time=1587945600.0 (422) 6.143992196s"
level=debug ts=2020-05-22T19:13:39.081287152Z caller=logging.go:44 traceID=1ff8e57bf643c559 msg="GET /api/v1/query?query=round%28avg_over_time%28sum%28sum+by+%28_id%29+%28max+without+%28prometheus%2Creceive%2Cinstance%29+%28+cluster%3Acapacity_cpu_cores%3Asum%29%29+%2B+on+%28_id%29+group_left%28_blah%29+%280+%2A+%28id_version_ebs_account_internal%3Acluster_subscribed%7B%7D+%3D%3D+1%29%29%29%5B1d%3A30m%5D%29%29&time=1588118400.0 (422) 4.621643791s"
level=debug ts=2020-05-22T19:13:44.465299862Z caller=logging.go:44 traceID=43ff1398f57ec32c msg="GET /api/v1/query?query=round%28avg_over_time%28sum%28sum+by+%28_id%29+%28max+without+%28prometheus%2Creceive%2Cinstance%29+%28+cluster%3Acapacity_cpu_cores%3Asum%29%29+%2B+on+%28_id%29+group_left%28_blah%29+%280+%2A+%28id_version_ebs_account_internal%3Acluster_subscribed%7B%7D+%3D%3D+1%29%29%29%5B1d%3A30m%5D%29%29&time=1588636800.0 (422) 2.583028696s"

These are the logs from the cortex-query-frontend for those queries.

maulikjs commented 4 years ago

And this is the log from the associated thanos-query pod


level=error ts=2020-05-22T19:48:11.719475257Z caller=engine.go:806 msg="runtime panic in parser" err="runtime error: index out of range [-1]" stacktrace="goroutine 2719879 [running]:\ngithub.com/prometheus/prometheus/promql.(*evaluator).recover(0xc0098f2da0, 0xc0098f2b98)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:804 +0xd8\npanic(0x1d695e0, 0xc0096dc360)\n\t/usr/local/go/src/runtime/panic.go:679 +0x1b2\ngithub.com/thanos-io/thanos/pkg/query.(*dedupSeriesSet).peekLset(0xc00584e060, 0x22ce0e0, 0xc0044dc1e0, 0x33b8480)\n\t/go/src/github.com/thanos-io/thanos/pkg/query/iter.go:372 +0x171\ngithub.com/thanos-io/thanos/pkg/query.(*dedupSeriesSet).next(0xc00584e060, 0xc009926400)\n\t/go/src/github.com/thanos-io/thanos/pkg/query/iter.go:388 +0x8a\ngithub.com/thanos-io/thanos/pkg/query.(*dedupSeriesSet).Next(0xc00584e060, 0x22d48e0)\n\t/go/src/github.com/thanos-io/thanos/pkg/query/iter.go:359 +0xaf\ngithub.com/prometheus/prometheus/promql.expandSeriesSet(0x2306760, 0xc0098e8d20, 0x22f1f20, 0xc00584e060, 0x16ef655, 0xc0098e8d20, 0x0, 0x0, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:758 +0x5b\ngithub.com/prometheus/prometheus/promql.checkForSeriesSetExpansion(0x2306760, 0xc0098e8d20, 0x22f1a20, 0xc0023056e0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:747 +0x159\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0098f1588, 0x22f1a20, 0xc0023056e0, 0xc0098e8ed0, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1262 +0x1d7b\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0xc0098f1588, 0xc0098ee420, 0xc0098ee538, 0x2, 0x2, 0x0, 0x0, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:888 +0x1408\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0098f1588, 0x22f17e0, 0xc000d68a50, 0xc0098e8e70, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1029 +0x3396\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0xc0098f1588, 0xc0098ef170, 0xc0098ef288, 0x2, 0x2, 0x0, 0x0, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:888 +0x1408\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0098f1588, 0x22f17e0, 0xc000d68a00, 0xc0098e8e10, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1029 +0x3396\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0xc0098f1588, 0xc0098efe60, 0xc0098eff18, 0x2, 0x2, 0x0, 0x0, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:888 +0x1408\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0098f1588, 0x22f1820, 0xc000e64200, 0xc0098e8db0, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1240 +0x26d8\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0xc0098f1588, 0xc0098f0c10, 0xc0098f0d28, 0x2, 0x2, 0x0, 0x445ebe, 0xc000ac8480)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:888 +0x1408\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0002f3588, 0x22f17e0, 0xc000d68aa0, 0x15, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1029 +0x3396\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0098f2da0, 0x22f19a0, 0xc000acd440, 0xc000ac8580, 0xc000ac8501)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1323 +0x1cf8\ngithub.com/prometheus/prometheus/promql.(*evaluator).evalSubquery(0xc0002f4da0, 0xc000acd440, 0x1ee2a73)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1001 +0x59\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0098f2da0, 0x22f1860, 0xc001d6a820, 0xc0098ea6a0, 0xc004da81c0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1064 +0x1969\ngithub.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0xc0098f2da0, 0xc0098f2730, 0xc000b24560, 0x1, 0x1, 0x0, 0xc000d68230, 0xc0002f45b8)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:888 +0x1408\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0002f4da0, 0x22f1860, 0xc001d6a860, 0xf, 0xc000d68640)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:1070 +0x18cc\ngithub.com/prometheus/prometheus/promql.(*evaluator).Eval(0xc0002f4da0, 0x22f1860, 0xc001d6a860, 0x0, 0x0, 0x0, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:815 +0x99\ngithub.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc0003625a0, 0x2306760, 0xc000acd5c0, 0xc0023057a0, 0xc000d68af0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:531 +0x43d\ngithub.com/prometheus/prometheus/promql.(*Engine).exec(0xc0003625a0, 0x2306760, 0xc000acd5c0, 0xc0023057a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:485 +0x669\ngithub.com/prometheus/prometheus/promql.(*query).Exec(0xc0023057a0, 0x2306760, 0xc000acd350, 0x0)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200110114423-1e64d757f711/promql/engine.go:199 +0xb4\ngithub.com/thanos-io/thanos/pkg/query/api.(*API).query(0xc0003632c0, 0xc00126e800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\n\t/go/src/github.com/thanos-io/thanos/pkg/query/api/v1.go:304 +0x4f1\ngithub.com/thanos-io/thanos/pkg/query/api.(*API).Register.func1.1(0x22fe820, 0xc0003d72d0, 0xc00126e800)\n\t/go/src/github.com/thanos-io/thanos/pkg/query/api/v1.go:145 +0x55\nnet/http.HandlerFunc.ServeHTTP(0xc000b858a0, 0x22fe820, 0xc0003d72d0, 0xc00126e800)\n\t/usr/local/go/src/net/http/server.go:2007 +0x44\ngithub.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1(0x7fd6b3f406c0, 0xc000d68960, 0xc00126e800)\n\t/go/pkg/mod/github.com/!n!y!times/gziphandler@v1.1.1/gzip.go:336 +0x23f\nnet/http.HandlerFunc.ServeHTTP(0xc000a78270, 0x7fd6b3f406c0, 0xc000d68960, 0xc00126e800)\n\t/usr/local/go/src/net/http/server.go:2007 +0x44\ngithub.com/thanos-io/thanos/pkg/tracing.HTTPMiddleware.func1(0x7fd6b3f406c0, 0xc000d68960, 0xc00126e700)\n\t/go/src/github.com/thanos-io/thanos/pkg/tracing/http.go:46 +0x549\nnet/http.HandlerFunc.ServeHTTP(0xc0003f7d10, 0x7fd6b3f406c0, 0xc000d68960, 0xc00126e700)\n\t/usr/local/go/src/net/http/server.go:2007 +0x44\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(0x7fd6b3f406c0, 0xc000d68910, 0xc00126e700)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.5.0/prometheus/promhttp/instrument_server.go:196 +0xe9\nnet/http.HandlerFunc.ServeHTTP(0xc000a785a0, 0x7fd6b3f406c0, 0xc000d68910, 0xc00126e700)\n\t/usr/local/go/src/net/http/server.go:2007 +0x44\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(0x7fd6b3f406c0, 0xc000d688c0, 0xc00126e700)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.5.0/prometheus/promhttp/instrument_server.go:100 +0xda\nnet/http.HandlerFunc.ServeHTTP(0xc000a78780, 0x7fd6b3f406c0, 0xc000d688c0, 0xc00126e700)\n\t/usr/local/go/src/net/http/server.go:2007 +0x44\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func1(0x7fd6b3f406c0, 0xc000d68870, 0xc00126e700)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.5.0/prometheus/promhttp/instrument_server.go:163 +0xe9\nnet/http.HandlerFunc.ServeHTTP(0xc000a789f0, 0x7fd6b3f406c0, 0xc000d68870, 0xc00126e700)\n\t/usr/local/go/src/net/http/server.go:2007 +0x44\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func1(0x22fe420, 0xc004f2a000, 0xc00126e700)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.5.0/prometheus/promhttp/instrument_server.go:68 +0x11c\ngithub.com/prometheus/common/route.(*Router).handle.func1(0x22fe420, 0xc004f2a000, 0xc00126e600, 0x0, 0x0, 0x0)\n\t/go/pkg/mod/github.com/prometheus/common@v0.9.1/route/route.go:83 +0x296\ngithub.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc000362660, 0x22fe420, 0xc004f2a000, 0xc00126e600)\n\t/go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0xa3d\ngithub.com/prometheus/common/route.(*Router).ServeHTTP(0xc000c28960, 0x22fe420, 0xc004f2a000, 0xc00126e600)\n\t/go/pkg/mod/github.com/prometheus/common@v0.9.1/route/route.go:121 +0x4c\nnet/http.(*ServeMux).ServeHTTP(0xc000236940, 0x22fe420, 0xc004f2a000, 0xc00126e600)\n\t/usr/local/go/src/net/http/server.go:2387 +0x1bd\nnet/http.serverHandler.ServeHTTP(0xc0001d40e0, 0x22fe420, 0xc004f2a000, 0xc00126e600)\n\t/usr/local/go/src/net/http/server.go:2802 +0xa4\nnet/http.(*conn).serve(0xc00016c3c0, 0x23066a0, 0xc000e640c0)\n\t/usr/local/go/src/net/http/server.go:1890 +0x875\ncreated by net/http.(*Server).Serve\n\t/usr/local/go/src/net/http/server.go:2927 +0x38e\n"
maulikjs commented 4 years ago

I couldnt find any error logs on the thanos-store pods Hope this helps :)

GiedriusS commented 4 years ago

That's definitely very useful! Would it be possible for you to check this with v0.13.0-rc.0 to see if it is still reproducible?

ipyker commented 4 years ago

That's definitely very useful! Would it be possible for you to check this with v0.13.0-rc.0 to see if it is still reproducible? I also encountered the error of index out of range [-1], but I tried to reduce the number of --query.replica-label parameters of the thanos-query.

sevagh commented 4 years ago

I did the same as ipyker - I added a new --query.replica-label=foo, restarted the Thanos Query, and this started occuring.

Luckily when I remove --query.replica-label=foo it goes away. Is it a problem if we pick a replica-label which does not exist on all our metrics? The label foo is only on some metrics, not all.

bwplotka commented 4 years ago

That is super helpful feedback, we will take a closer look tomorrow, thanks! (:

On Sun, 31 May 2020 at 02:12, Sevag Hanssian notifications@github.com wrote:

I did the same as ipyker - removed a previous --query.replica-label, restarted the Thanos Query, and this started occuring.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/2645#issuecomment-636405987, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3O7PYSEQS77WJWZ633LRUGVHNANCNFSM4NH56CGQ .

maulikjs commented 4 years ago

Hey @bwplotka , Do we know whats going on yet?

Would setting up a query pod with none of the --query.replica-label=foo potentially solve our problem?

bwplotka commented 4 years ago

On it, now

bwplotka commented 4 years ago

Ok, got it. The problem is that Querier just have problems with more than 2 replica labels. Fixing.

Thanks for spotting this and quickly describing issue :heart:

sevagh commented 4 years ago

That makes sense - I have 2 replica labels, and encountered this when adding a 3rd. Thanks for the quick turnaround.

bwplotka commented 4 years ago

Silly missed edge case fixed here: https://github.com/thanos-io/thanos/pull/2728

sevagh commented 4 years ago

Will this make it into 0.13?

bwplotka commented 4 years ago

Not really, but it is available on master right now (we build images for master). We run the current master on production now, so I would recommend upgrading at least querier if you are interested in this fix. Also next release is pretty soon (3 weeks?)

TL;DR the bug was bit more complex. It was panicking for all requests that return series with less or same number labels vs number of replica labels (:

On Wed, 10 Jun 2020 at 15:40, Sevag Hanssian notifications@github.com wrote:

Will this make it into 0.13?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/2645#issuecomment-642054667, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3O6FN4N3P373E7VSTVTRV6LPTANCNFSM4NH56CGQ .