grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.92k stars 3.45k forks source link

Error "failed to transfer chunks to another instance" #1983

Closed reidlai closed 4 years ago

reidlai commented 4 years ago

Describe the bug I have a single pod loki statefulset running in kubernetes and everything works fine until i tried to load 3000 mqtt payloads into backend services. I got this error "failed to transfer chunks to another instance" after running my service over 30 mins and Loki pod got "CrashLoopBackOff"

To Reproduce Cannot expose the docker to reproduce error

Expected behavior Loki pod won't crash in Kubernetes with status "CrashLoopBackOff"

Environment: Infrastructure: Kubernetes Statefulset

Screenshots, Promtail config, or terminal output $> kubectl logs --follow pod/loki-0 level=info ts=2020-04-25T06:55:50.901631126Z caller=table_manager.go:444 msg="creating table" table=index_2576 level=info ts=2020-04-25T06:55:50.901640392Z caller=table_manager.go:444 msg="creating table" table=index_2596 level=info ts=2020-04-25T06:55:50.901646809Z caller=table_manager.go:444 msg="creating table" table=index_2533 level=info ts=2020-04-25T06:55:50.904564011Z caller=table_manager.go:444 msg="creating table" table=index_2609 level=info ts=2020-04-25T06:55:50.904644477Z caller=table_manager.go:444 msg="creating table" table=index_2524 level=info ts=2020-04-25T06:55:50.90465942Z caller=table_manager.go:444 msg="creating table" table=index_2534 level=info ts=2020-04-25T06:55:50.904677816Z caller=table_manager.go:444 msg="creating table" table=index_2542 level=info ts=2020-04-25T06:55:50.904706722Z caller=table_manager.go:444 msg="creating table" table=index_2589 level=info ts=2020-04-25T06:55:50.904745392Z caller=table_manager.go:444 msg="creating table" table=index_2539 level=info ts=2020-04-25T06:55:50.904762598Z caller=table_manager.go:444 msg="creating table" table=index_2544 level=info ts=2020-04-25T06:55:50.904800151Z caller=table_manager.go:444 msg="creating table" table=index_2559 level=info ts=2020-04-25T06:55:50.904821259Z caller=table_manager.go:444 msg="creating table" table=index_2574 level=info ts=2020-04-25T06:55:50.904830138Z caller=table_manager.go:444 msg="creating table" table=index_2562 level=info ts=2020-04-25T06:55:50.904881375Z caller=table_manager.go:444 msg="creating table" table=index_2577 level=info ts=2020-04-25T06:55:50.90490025Z caller=table_manager.go:444 msg="creating table" table=index_2579 level=info ts=2020-04-25T06:55:50.904913381Z caller=table_manager.go:444 msg="creating table" table=index_2587 level=warn ts=2020-04-25T06:56:22.066994544Z caller=pool.go:182 msg="removing ingester failing healthcheck" addr=127.0.0.1:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded" level=warn ts=2020-04-25T06:56:37.806248557Z caller=pool.go:182 msg="removing ingester failing healthcheck" addr=127.0.0.1:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded" level=warn ts=2020-04-25T06:56:39.489737491Z caller=logging.go:49 traceID=1e84516838bcc25a msg="POST /loki/api/v1/push (500) 2.754052713s Response: \"rpc error: code = Canceled desc = grpc: the client connection is closing\n\" ws: false; Content-Length: 3288; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; " level=info ts=2020-04-25T06:56:59.68314271Z caller=signals.go:55 msg="=== received SIGINT/SIGTERM ===\n*** exiting" level=info ts=2020-04-25T06:56:59.748347187Z caller=module_service.go:93 msg="module stopped" module=table-manager level=info ts=2020-04-25T06:56:59.834159821Z caller=module_service.go:93 msg="module stopped" module=distributor level=info ts=2020-04-25T06:57:00.40858658Z caller=lifecycler.go:411 msg="lifecycler loop() exited gracefully" ring=ingester level=info ts=2020-04-25T06:57:01.748967581Z caller=lifecycler.go:689 msg="changing instance state from" old_state=ACTIVE new_state=LEAVING ring=ingester level=info ts=2020-04-25T06:57:04.771570295Z caller=module_service.go:93 msg="module stopped" module=ring level=info ts=2020-04-25T06:57:07.162351788Z caller=module_service.go:93 msg="module stopped" module=runtime-config level=error ts=2020-04-25T06:57:07.998608943Z caller=transfer.go:191 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters" level=error ts=2020-04-25T06:57:08.897851371Z caller=lifecycler.go:730 msg="failed to transfer chunks to another instance" ring=ingester err="terminated after 1 retries" panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x111aa6f]

goroutine 430 [running]: github.com/grafana/loki/pkg/chunkenc.(*Facade).Size(0xc0018e80c0, 0x3c2c640)

:1 +0x2f github.com/cortexproject/cortex/pkg/chunk/storage.metricsChunkClient.PutChunks(0x3c2cac0, 0xc00000c9a0, 0xc00027e9f8, 0xc00027ea00, 0xc00027ea08, 0xc00027ea10, 0x3c2c640, 0xc0018f03f0, 0xc0012f1300, 0x1, ...) /src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/metrics.go:71 +0x205 github.com/cortexproject/cortex/pkg/chunk.(*seriesStore).PutOne(0xc0004642c0, 0x3c2c640, 0xc0018f03f0, 0x171b01e7d36, 0x171b01f9285, 0x989d0aca61fcc7e3, 0x34edc0e, 0x4, 0x171b01e7d36, 0x171b01f9285, ...) /src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/series_store.go:438 +0x672 github.com/cortexproject/cortex/pkg/chunk.compositeStore.Put.func1(0x171b01e7d36, 0x171b01f9285, 0x3c76aa0, 0xc0004642c0, 0x44615c, 0x50f8bb) /src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/composite_store.go:95 +0xd3 github.com/cortexproject/cortex/pkg/chunk.compositeStore.forStores(0xc0002df400, 0x1, 0x1, 0x171b01e7d36, 0x171b01f9285, 0xc000c8b838, 0x1a0, 0x72e) /src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/composite_store.go:231 +0x1e0 github.com/cortexproject/cortex/pkg/chunk.compositeStore.Put(0xc0002df400, 0x1, 0x1, 0x3c2c600, 0xc0018f2120, 0xc0012f0080, 0x1, 0x1, 0x4a91d0, 0xc00019fb30) /src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/composite_store.go:94 +0x14b github.com/grafana/loki/pkg/ingester.(*Ingester).flushChunks(0xc00013ae00, 0x3c2c600, 0xc0018f2120, 0x989d0aca61fcc7e3, 0xc0001e7440, 0x9, 0x9, 0xc0000d20c0, 0x1, 0x1, ...) /src/loki/pkg/ingester/flush.go:337 +0x82d github.com/grafana/loki/pkg/ingester.(*Ingester).flushUserSeries(0xc00013ae00, 0x34edc0e, 0x4, 0x989d0aca61fcc7e3, 0x1, 0x0, 0x0) /src/loki/pkg/ingester/flush.go:224 +0x274 github.com/grafana/loki/pkg/ingester.(*Ingester).flushLoop(0xc00013ae00, 0x3) /src/loki/pkg/ingester/flush.go:196 +0x304 created by github.com/grafana/loki/pkg/ingester.(*Ingester).starting /src/loki/pkg/ingester/ingester.go:168 +0x68
slim-bean commented 4 years ago

What version of Loki is this?

A couple things happening here, including a panic which is bad!

First something told Loki to shutdown, not sure why though but this would be something in kubernetes sending the sigterm signal to Loki, how are you deploying the helm chart from the Loki repo?

Second, if running just one instance you should set a parameter called max_transfer_retries to 0 see here

However there still should not be a panic!

So i'm not sure what is telling Loki to shutdown (auto scaling in k8's, readiness handler not being ready? are you using a liveliness check?)

But if you could let us know what version of Loki and tell us how you are deploying that would help us figure out that panic.

Thanks!

slim-bean commented 4 years ago

@gouthamve I think this is likely a master build of Loki which is pulling in a very recent Cortex and I think that panic might be of interest to you, I think you recently added the metricsChunkClient

onesbasket commented 4 years ago

I have same issue. I got error at latest&master-e90e3b7, but master-7573078 is working fine. thus, I assume some cause is in new versions.

onesbasket commented 4 years ago

woops, I forgot mention that is docker image.