jaegertracing / jaeger-clickhouse

Jaeger ClickHouse storage plugin implementation
Apache License 2.0
233 stars 50 forks source link

Got plugin error "transport: error while dialing: dial unix /tmp/plugin" #99

Open bocharovf opened 2 years ago

bocharovf commented 2 years ago

Describe the bug Got error in Jaeger UI with Clickhouse gRPC plugin when search for traces: HTTP Error: plugin error: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin2381205323: connect: connection refused

Seems it happens

Clickhouse is up and running. Restarting Jaeger Query fix the problem temporary (until next long search).

To Reproduce Steps to reproduce the behavior:

  1. Ingest ~70kk rows in jaeger_index_local
  2. Search for traces

Expected behavior Traces are found

Screenshots image

Version (please complete the following information):

What troubleshooting steps did you try?

Additional context jaeger-query-clickhouse-5ff64c9dbc-h7jr4.log

bocharovf commented 2 years ago

Just measured: error occurs after 16.7s of trace searching.

In Clickhouse error log: 2021.11.27 11:10:51.904396 [ 239 ] {5407e9ea-5b0a-4307-b4ba-645e44462eeb} <Error> TCPHandler: Code: 210, e.displayText() = DB::Exception: Connection reset by peer, while writing to socket (_10.X.X.X_:61097), Stack trace: ... 2021.11.27 11:10:51.904562 [ 239 ] {5407e9ea-5b0a-4307-b4ba-645e44462eeb} <Warning> TCPHandler: Client has gone away.

bocharovf commented 2 years ago

Related to #100 . Seems that huge traces (100k+ spans) leads to timeout. Ability to limit fetched spans could partially help.

nickbp commented 2 years ago

From the comments it sounds like this is caused by (and resolved by) #100

bocharovf commented 2 years ago

Unfortunately #100 helped only partially. There are still errors on our test server. I guess that we incorrectly handle query timeout or connection drops with CH. After error occurs plugin stop working at all and all further requests are failed. So that is severe problem for us.

vosmax commented 2 years ago

I've faced with this issue as well. I believe it's related to resources. When querying a huge bunch of data, jaeger starts to consume memory avalanche-like (see pic) Screenshot Sometimes OOMKill doesn't occur, because when the pod almost hits the memory request value the plugin process just killed, and then memory flushed. As a result, the pod is not restarted and the plugin doesn't work. I attached a memory graph where clearly you may see that the pod almost hit the request in 1Gb then memory was flushed and error transport: error while dialing: dial unix /tmp/plugin... connection refused appeared and 2022-02-11T23:05:17.578Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=13 error="signal: killed" in the pod log. I reproduced this error in several minutes (blue graph)


2022-02-11T23:05:17.578Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=13 error="signal: killed"
{"level":"info","ts":1644609917.5783212,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true}
2022-02-11T23:05:17.592Z [DEBUG] stdio: received EOF, stopping recv loop: err="rpc error: code = Canceled desc = context canceled"
{"level":"info","ts":1644609917.5928702,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {IDLE <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644609917.5929146,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644609917.5929787,"caller":"grpclog/component.go:71","msg":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3780131,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.378047,"caller":"channelz/logging.go:50","msg":"[core]Subchannel picks a new address \"unused\" to connect","system":"grpc","grpc_log":true}
{"level":"warn","ts":1644610105.3781164,"caller":"channelz/logging.go:75","msg":"[core]grpc: addrConn.createTransport failed to connect to {unused unused <nil>  0 }. Err: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781302,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781457,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {CONNECTING <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781548,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781884,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {TRANSIENT_FAILURE connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781965,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"error","ts":1644610105.378254,"caller":"app/http_handler.go:487","msg":"HTTP handler, Internal Server Error","error":"plugin error: rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:487\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).getOperationsLegacy\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:180\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5\n\tgithub.com/opentracing-contrib/go-stdlib@v1.0.0/nethttp/server.go:154\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/mux@v1.8.0/mux.go:210\ngithub.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\tgithub.com/gorilla/handlers@v1.5.1/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\tgithub.com/gorilla/handlers@v1.5.1/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2879\nnet/http.(*conn).serve\n\tnet/http/server.go:1930"}
{"level":"info","ts":1644610106.3785865,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610106.3786347,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {IDLE connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"}","system":"grpc","grpc_log":true}``` 
leizhag commented 2 years ago

Jaeger Query exits on plugin crash starting from v1.33.0. This can be a workaround until the reason for the plugin crash is found.