Open bocharovf opened 2 years ago
Just measured: error occurs after 16.7s of trace searching.
In Clickhouse error log:
2021.11.27 11:10:51.904396 [ 239 ] {5407e9ea-5b0a-4307-b4ba-645e44462eeb} <Error> TCPHandler: Code: 210, e.displayText() = DB::Exception: Connection reset by peer, while writing to socket (_10.X.X.X_:61097), Stack trace: ... 2021.11.27 11:10:51.904562 [ 239 ] {5407e9ea-5b0a-4307-b4ba-645e44462eeb} <Warning> TCPHandler: Client has gone away.
Related to #100 . Seems that huge traces (100k+ spans) leads to timeout. Ability to limit fetched spans could partially help.
From the comments it sounds like this is caused by (and resolved by) #100
Unfortunately #100 helped only partially. There are still errors on our test server. I guess that we incorrectly handle query timeout or connection drops with CH. After error occurs plugin stop working at all and all further requests are failed. So that is severe problem for us.
I've faced with this issue as well. I believe it's related to resources. When querying a huge bunch of data, jaeger starts to consume memory avalanche-like (see pic)
Sometimes OOMKill doesn't occur, because when the pod almost hits the memory request value the plugin process just killed, and then memory flushed. As a result, the pod is not restarted and the plugin doesn't work.
I attached a memory graph where clearly you may see that the pod almost hit the request in 1Gb then memory was flushed and error transport: error while dialing: dial unix /tmp/plugin... connection refused
appeared and 2022-02-11T23:05:17.578Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=13 error="signal: killed"
in the pod log.
I reproduced this error in several minutes (blue graph)
2022-02-11T23:05:17.578Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=13 error="signal: killed"
{"level":"info","ts":1644609917.5783212,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true}
2022-02-11T23:05:17.592Z [DEBUG] stdio: received EOF, stopping recv loop: err="rpc error: code = Canceled desc = context canceled"
{"level":"info","ts":1644609917.5928702,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {IDLE <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644609917.5929146,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644609917.5929787,"caller":"grpclog/component.go:71","msg":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3780131,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.378047,"caller":"channelz/logging.go:50","msg":"[core]Subchannel picks a new address \"unused\" to connect","system":"grpc","grpc_log":true}
{"level":"warn","ts":1644610105.3781164,"caller":"channelz/logging.go:75","msg":"[core]grpc: addrConn.createTransport failed to connect to {unused unused <nil> 0 }. Err: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781302,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781457,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {CONNECTING <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781548,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781884,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {TRANSIENT_FAILURE connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781965,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"error","ts":1644610105.378254,"caller":"app/http_handler.go:487","msg":"HTTP handler, Internal Server Error","error":"plugin error: rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:487\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).getOperationsLegacy\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:180\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5\n\tgithub.com/opentracing-contrib/go-stdlib@v1.0.0/nethttp/server.go:154\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/mux@v1.8.0/mux.go:210\ngithub.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\tgithub.com/gorilla/handlers@v1.5.1/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\tgithub.com/gorilla/handlers@v1.5.1/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2879\nnet/http.(*conn).serve\n\tnet/http/server.go:1930"}
{"level":"info","ts":1644610106.3785865,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610106.3786347,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {IDLE connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"}","system":"grpc","grpc_log":true}```
Describe the bug Got error in Jaeger UI with Clickhouse gRPC plugin when search for traces: HTTP Error: plugin error: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin2381205323: connect: connection refused
Seems it happens
Clickhouse is up and running. Restarting Jaeger Query fix the problem temporary (until next long search).
To Reproduce Steps to reproduce the behavior:
Expected behavior Traces are found
Screenshots
Version (please complete the following information):
What troubleshooting steps did you try?
Additional context jaeger-query-clickhouse-5ff64c9dbc-h7jr4.log