jaegertracing / jaeger-operator

Jaeger Operator for Kubernetes simplifies deploying and running Jaeger on Kubernetes.
https://www.jaegertracing.io/docs/latest/operator/
Apache License 2.0
1.02k stars 344 forks source link

DNS errors in k8s #788

Closed sonant closed 4 years ago

sonant commented 4 years ago

I have standard deploy in k8s, like this https://www.jaegertracing.io/docs/1.13/operator/ {"level":"info","ts":1574763937.5043714,"caller":"dns/dns_resolver.go:286","msg":"grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.jaeger-collector-headless.observability.svc.cluster.local on 172.20.0.10:53: no such host.\n","system":"grpc","grpc_log":true} {"level":"info","ts":1574763937.5120873,"caller":"dns/dns_resolver.go:311","msg":"grpc: failed dns TXT record lookup due to lookup _grpc_config.jaeger-collector-headless.observability.svc.cluster.local on 172.20.0.10:53: no such host.\n","system":"grpc","grpc_log":true} {"level":"info","ts":1574763937.5121202,"caller":"grpc/resolver_conn_wrapper.go:219","msg":"ccResolverWrapper: got new service config: ","system":"grpc","grpc_log":true} {"level":"info","ts":1574763937.5121377,"caller":"grpc/resolver_conn_wrapper.go:222","msg":"ccResolverWrapper: error parsing service config: no JSON service config provided","system":"grpc","grpc_log":true} {"level":"info","ts":1574763937.5121775,"caller":"grpc/resolver_conn_wrapper.go:205","msg":"ccResolverWrapper: sending new addresses to cc: [{10.0.100.4:14250 0 <nil>}]","system":"grpc","grpc_log":true} {"level":"info","ts":1574763937.5122035,"caller":"base/balancer.go:83","msg":"base.baseBalancer: got new ClientConn state: {{[{10.0.100.4:14250 0 <nil>}] <nil>} <nil>}","system":"grpc","grpc_log":true} This ports doesn't present in headless svc - grpclb._tcp.jaeger-collector-headless.observability.svc.cluster.local. DNS in k8s works fine. what I should do? Thx!

treyhyde commented 4 years ago

I was just trying to debug the same thing. I can't figure out where grpclb is coming from, the only service port defined is grpc, no lb. I'm using 1.15.1

sonant commented 4 years ago

This is info messages from jaeger-agent that is running as a sidecar.

jpkrohling commented 4 years ago

Most of the messages are innocuous, as it's just gRPC attempting to find a source of information for its client-side load balancing. When no data is found on the specific TXT records, the backend services are discovered by looking at the A records for the name jaeger-collector-headless.observability.svc.cluster.local.

In short: it's all fine and no error happened.

sonant commented 4 years ago

Yes, but it's so much messages, can I change --reporter.grpc.host-port=dns:///jaeger-collector-headless.observability:14250 to --reporter.grpc.host-port=jaeger-collector-headless.observability:14250 in operator? or maybe change log level for this messages?

jpkrohling commented 4 years ago

We should definitely try to suppress those messages, as you are not the first one to get confused about that.

I'm not sure your change to remove the protocol part would work, though. The headless service is used only as a DNS entry to hold the A records. The dns protocol part in the address is a hint to the gRPC client to get a list of collectors and will load balance among them, even for long-lived HTTP/2 calls.

A non-headless service will report only one IP to the client, and will act as a proxy. Typically, the load balancing works fine and seamlessly but given that the connections are long-lived, the load balancing will not really be based on "load" (kube-proxy isn't that smart yet, as far as I know).

I'm not 100% sure what will happen if you use the headless service without dns as part of the protocol. I think the gRPC client will just pick one IP from the pool and use it for a long lived connection, being effectively the same as the situation above, just without kube-proxy.

sonant commented 4 years ago

This message is not a problem, but not every 30sec.I changed --reporter.grpc.host-port, these messages no longer present.

Below all my log after startup the agent(after changing).

{"level":"info","ts":1574845572.5909839,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"} {"level":"info","ts":1574845572.5969596,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"} {"level":"info","ts":1574845572.5970294,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14271} {"level":"info","ts":1574845572.5970523,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14271,"health-status":"unavailable"} {"level":"info","ts":1574845572.5981328,"caller":"grpc/builder.go:65","msg":"Agent requested insecure grpc connection to collector(s)"} {"level":"info","ts":1574845572.5992768,"caller":"grpc/clientconn.go:245","msg":"parsed scheme: \"\"","system":"grpc","grpc_log":true} {"level":"info","ts":1574845572.6018035,"caller":"grpc/clientconn.go:251","msg":"scheme \"\" not registered, fallback to default scheme","system":"grpc","grpc_log":true} {"level":"info","ts":1574845572.6018248,"caller":"grpc/resolver_conn_wrapper.go:178","msg":"ccResolverWrapper: sending update to cc: {[{jaeger-collector-headless.observability:14250 0 <nil>}] <nil>}","system":"grpc","grpc_log":true} {"level":"info","ts":1574845572.6018553,"caller":"grpc/clientconn.go:659","msg":"ClientConn switching balancer to \"round_robin\"","system":"grpc","grpc_log":true} {"level":"info","ts":1574845572.602664,"caller":"base/balancer.go:83","msg":"base.baseBalancer: got new ClientConn state: {{[{jaeger-collector-headless.observability:14250 0 <nil>}] <nil>} <nil>}","system":"grpc","grpc_log":true} {"level":"info","ts":1574845572.603373,"caller":"base/balancer.go:140","msg":"base.baseBalancer: handle SubConn state change: 0xc00020e210, CONNECTING","system":"grpc","grpc_log":true} {"level":"info","ts":1574845572.6063142,"caller":"agent/main.go:75","msg":"Starting agent"} {"level":"info","ts":1574845572.6065211,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"} {"level":"info","ts":1574845572.6066046,"caller":"app/agent.go:69","msg":"Starting jaeger-agent HTTP server","http-port":5778} {"level":"info","ts":1574845572.6133926,"caller":"base/balancer.go:140","msg":"base.baseBalancer: handle SubConn state change: 0xc00020e210, READY","system":"grpc","grpc_log":true} {"level":"info","ts":1574845572.6136549,"caller":"roundrobin/roundrobin.go:50","msg":"roundrobinPicker: newPicker called with readySCs: map[{jaeger-collector-headless.observability:14250 0 <nil>}:0xc00020e210]","system":"grpc","grpc_log":true}

jpkrohling commented 4 years ago

This message is not a problem, but not every 30sec.I changed --reporter.grpc.host-port, these messages no longer present.

It should definitely not be happening every 30s. It should happen only once, during the bootstrap. While your change would still work and not cause any harm if you have only one collector replica, I would expect load balancing to not work if you are using multiple collector replicas.

jpkrohling commented 4 years ago

@jkandasa, @kevinearls could you please try to reproduce what @sonant is facing?

@sonant could you share more info about your Kubernetes environment (platform, versions, ...)?

sonant commented 4 years ago

kubectl version Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:23:09Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-eks-b7174d", GitCommit:"b7174db5ee0e30c94a0b9899c20ac980c0850fc8", GitTreeState:"clean", BuildDate:"2019-10-18T17:56:01Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"} AWS EKS cluster

jpkrohling commented 4 years ago

AWS EKS cluster

This might be the key then: perhaps they kill long lived connections?

sonant commented 4 years ago

This is sidecar describe, tcp or udp ports?

jaeger-agent:
    Container ID:  docker://ac817996cb5b8d08c942eb06ce2af77ed9b9546c854448deafb0fc7643227f18
    Image:         jaegertracing/jaeger-agent:1.15.1
    Image ID:      docker-pullable://jaegertracing/jaeger-agent@sha256:2303ed275c29506e683fdfd32c499427fa9ad92b9ec0879404b9b00c6ecaa50e
    Ports:         5775/TCP, 5778/TCP, 6831/TCP, 6832/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Args:
      --jaeger.tags=cluster=undefined,deployment.name=demo-metricswriter,pod.namespace=demo,pod.name=${POD_NAME:},container.name=demo-metricswriter
      --reporter.grpc.host-port=jaeger-collector-headless.observability:14250
      --reporter.type=grpc
jpkrohling commented 4 years ago

Not sure I understand your question. The ports you listed are for incoming connections from your application, they aren't relevant for the outgoing connection from the sidecar to the collector.

sonant commented 4 years ago

Hello! It's the second question, I talk about it: Ports: []corev1.ContainerPort{ { ContainerPort: zkCompactTrft, Name: "zk-compact-trft", }, { ContainerPort: configRest, Name: "config-rest", }, { ContainerPort: jgCompactTrft, Name: "jg-compact-trft", }, { ContainerPort: jgBinaryTrft, Name: "jg-binary-trft", }, Port types in injector now NOT present? and by default it's TCP, I think it's not right for agent sidecar? it should be UDP for some ports? or am I wrong?

In this file, we are using port types - https://github.com/jaegertracing/jaeger-operator/blob/master/pkg/deployment/agent.go In this, we are not using port types - https://github.com/jaegertracing/jaeger-operator/blob/v1.9.0/pkg/inject/sidecar.go

It is the right logic? Thx!

jpkrohling commented 4 years ago

Yes, see #773.

jpkrohling commented 4 years ago

@sonant, to continue investigating this issue, I need you to confirm whether AWS EKS is killing long-lived connections afters some time. You can confirm this with a simple application connected to another via gRPC.

stale[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.