jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.54k stars 2.44k forks source link

jaeger agents unable to send data to collectors #1995

Closed Alexc0007 closed 4 years ago

Alexc0007 commented 4 years ago

Requirement - what kind of business use case are you trying to solve?

Hi Everyone , im trying to setup jaeger to trace networking on my kubernetes cluster. my cluster is an EKS(AWS) managed by Ocean(Spotinst) i installed jaeger as production build and agents are installed in daemonset strategy from helm: https://jaegertracing.github.io/helm-charts/

the issue im facing is that my agents are unable to send data to the collectors. both the daemonset agents and an agent i tried to setup as a sidecar to a pod.

agent configuration looks like this:

Name:           jaeger-agent-fm5sb
Namespace:      observability
Priority:       0
Node:           IP
Start Time:     Tue, 31 Dec 2019 06:59:16 +0000
Labels:         app.kubernetes.io/component=agent
                app.kubernetes.io/instance=jaeger
                app.kubernetes.io/name=jaeger
                controller-revision-hash=7bdb8fcfb5
                pod-template-generation=1
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Running
IP:             IP
IPs:            <none>
Controlled By:  DaemonSet/jaeger-agent
Containers:
  jaeger-agent:
    Container ID:   docker://5fcc545050ab7fe13c64879e9e0471767d5196fd9f01cab071e46bff1652c4f1
    Image:          jaegertracing/jaeger-agent:1.16.0
    Image ID:       docker-pullable://jaegertracing/jaeger-agent@sha256:2de3e2324880c16c150ab56df19fe5c320c29164e2ea4a4cb15f543ac40cc3f6
    Ports:          5775/UDP, 6831/UDP, 6832/UDP, 5778/TCP, 14271/TCP
    Host Ports:     0/UDP, 0/UDP, 0/UDP, 0/TCP, 0/TCP
    State:          Running
      Started:      Tue, 31 Dec 2019 06:59:18 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:admin/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:admin/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      REPORTER_GRPC_HOST_PORT:  jaeger-collector:14250
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from jaeger-agent-token-r245b (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  jaeger-agent-token-r245b:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  jaeger-agent-token-r245b
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:          <none>

the error i see in the agent logs is looking like this:

{"level":"info","ts":1577790697.5854564,"caller":"base/balancer.go:140","msg":"base.baseBalancer: handle SubConn state change: 0xc000145e70, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1577790697.5855157,"caller":"roundrobin/roundrobin.go:50","msg":"roundrobinPicker: newPicker called with readySCs: map[]","system":"grpc","grpc_log":true}
{"level":"info","ts":1577790727.589917,"caller":"grpc/clientconn.go:1283","msg":"grpc: addrConn.createTransport failed to connect to {jaeger-collector:14250 0  <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp: lookup jaeger-collector on 10.100.0.10:53: read udp 172.19.104.108:60679->10.100.0.10:53: i/o timeout\". Reconnecting...","system":"grpc","grpc_log":true}
{"level":"info","ts":1577790727.5899866,"caller":"base/balancer.go:140","msg":"base.baseBalancer: handle SubConn state change: 0xc000145e70, TRANSIENT_FAILURE","system":"grpc","grpc_log":true}

if anyone has any ideas why this might be happening and what is the cause, ill be happy for an explanation. thanks ahead

jpkrohling commented 4 years ago

Could you try to qualify the Jaeger Collector's hostname with the namespace it's deployed on? Like, jaeger-collector.default:14250. Ideally, you'd use the DNS protocol + headless service, so that your agent connections get load balanced across the collectors: dns:///jaeger-collector-headless.default:14250

Alexc0007 commented 4 years ago

hi , thank u for trying to help! ill try that and update whats happening

Alexc0007 commented 4 years ago

@jpkrohling , i followed your advice and added the namespace of the collector to the deployment... the errors seem to have stopped... however, the only traces i still see from jaeger are from the service jaeger-query why dont i see traces of other services? what am i missing here?

jpkrohling commented 4 years ago

Hard to say without further information. Have you checked the troubleshooting guide? https://www.jaegertracing.io/docs/1.16/troubleshooting/

Alexc0007 commented 4 years ago

Hi, the troubleshooting guide isnt informative enough... when i look at my agent logs, they dont seem to write anything... so as the collector logs...

this is all the info i see on the agent's log:

2020/01/21 09:28:35 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
{"level":"info","ts":1579598915.8331347,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1579598915.8337948,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1579598915.8340962,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14271}
{"level":"info","ts":1579598915.8341193,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14271,"health-status":"unavailable"}
{"level":"warn","ts":1579598915.834641,"caller":"tchannel/flags.go:72","msg":"Using deprecated configuration","option":"collector.host-port"}
{"level":"info","ts":1579598915.8363159,"caller":"grpc/builder.go:65","msg":"Agent requested insecure grpc connection to collector(s)"}
{"level":"info","ts":1579598915.8363705,"caller":"grpc/clientconn.go:245","msg":"parsed scheme: \"\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1579598915.8364089,"caller":"grpc/clientconn.go:251","msg":"scheme \"\" not registered, fallback to default scheme","system":"grpc","grpc_log":true}
{"level":"info","ts":1579598915.8364532,"caller":"grpc/resolver_conn_wrapper.go:178","msg":"ccResolverWrapper: sending update to cc: {[{jaeger-collector.observability:14250 0  <nil>}] <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1579598915.836476,"caller":"grpc/clientconn.go:659","msg":"ClientConn switching balancer to \"round_robin\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1579598915.8365076,"caller":"base/balancer.go:83","msg":"base.baseBalancer: got new ClientConn state: {{[{jaeger-collector.observability:14250 0  <nil>}] <nil>} <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1579598915.8428907,"caller":"base/balancer.go:140","msg":"base.baseBalancer: handle SubConn state change: 0xc000216eb0, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1579598915.847945,"caller":"agent/main.go:75","msg":"Starting agent"}
{"level":"info","ts":1579598915.8480184,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1579598915.8483636,"caller":"app/agent.go:69","msg":"Starting jaeger-agent HTTP server","http-port":5778}
{"level":"info","ts":1579598925.8471503,"caller":"base/balancer.go:140","msg":"base.baseBalancer: handle SubConn state change: 0xc000216eb0, READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1579598925.8472536,"caller":"roundrobin/roundrobin.go:50","msg":"roundrobinPicker: newPicker called with readySCs: map[{jaeger-collector.observability:14250 0  <nil>}:0xc000216eb0]","system":"grpc","grpc_log":true}

the only traces i do see on the query are traces of jaeger-query... it seems like there is a sidecar agent running in the jaeger-query pod...

i also tried to copy this sidecar agent to a different service, however it doesnt report anything... on its logs, i see the same output as i reported above.

what am i missing?

jpkrohling commented 4 years ago

i also tried to copy this sidecar agent to a different service, however it doesnt report anything... on its logs, i see the same output as i reported above.

How are your applications instrumented? Can you turn on the client logs as well (see "Use the logging reporter" in the troubleshooting guide)? If you can't see spans being reported by your application, that might explain why the agent isn't logging the span batches its receiving ;-)

Alexc0007 commented 4 years ago

after reading a bit more about jaeger, ill ask just to be sure... the clients have to be implemented inside my applications? cant the agents just collect data? if this is the case, then i have no clients.... only agents deployed...

another question in case the above is true, even if an agent is deployed as a sidecar, it cant serve as a client?

jpkrohling commented 4 years ago

after reading a bit more about jaeger, ill ask just to be sure...

You might want to check the OpenTracing tutorial to have a better understanding of how it all fits together. The same basic principles apply to most of the modern distributed tracing tools, including Jaeger.

the clients have to be implemented inside my applications?

Typically, yes. You'd instrument your code with an API like OpenTracing, and plug a concrete client (aka tracer), which will send this data "somewhere". In Jaeger's case, this "somewhere" is typically the local agent.

You can use some libraries that automatically instrument frameworks and platforms, such as Quarkus, Spring Boot, and so on.

cant the agents just collect data?

Jaeger Agents are "passive", in that they just receive data that is reported by the clients.

yurishkuro commented 4 years ago

Assuming this is resolved and closing. Please reopen if needed.