kubewharf / kelemetry

Global control plane tracing for Kubernetes
Apache License 2.0
252 stars 28 forks source link

kelemetry-jaeger-query-1 keep restarting in dev deployment #138

Closed calvinxu closed 1 year ago

calvinxu commented 1 year ago

Steps to reproduce

  1. check out 0.2.2 tag from kelemetry repo
  2. make stack

Expected behavior

docker container are up and running

Actual behavior

kelemetry-jaeger-query-1 keep restaring with logs

..{"level":"info","ts":1689923838.4368327,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc000202018, {IDLE connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.1:17271: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4368534,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4369116,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4369738,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel picks a new address \"172.17.0.1:17271\" to connect","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4370942,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc000202018, {CONNECTING <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.437123,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4372501,"caller":"grpclog/component.go:71","msg":"[core]Creating new client transport to \"{\\n  \\\"Addr\\\": \\\"172.17.0.1:17271\\\",\\n  \\\"ServerName\\\": \\\"172.17.0.1:17271\\\",\\n  \\\"Attributes\\\": null,\\n  \\\"BalancerAttributes\\\": null,\\n  \\\"Type\\\": 0,\\n  \\\"Metadata\\\": null\\n}\": connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.1:17271: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"warn","ts":1689923838.4372795,"caller":"channelz/funcs.go:342","msg":"[core][Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {\n  \"Addr\": \"172.17.0.1:17271\",\n  \"ServerName\": \"172.17.0.1:17271\",\n  \"Attributes\": null,\n  \"BalancerAttributes\": null,\n  \"Type\": 0,\n  \"Metadata\": null\n}. Err: connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.1:17271: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.437295,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.437333,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc000202018, {TRANSIENT_FAILURE connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.1:17271: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4373517,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923840.6481597,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923840.6482139,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923840.6482327,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel deleted","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923840.6482399,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel deleted","system":"grpc","grpc_log":true}
{"level":"fatal","ts":1689923840.6482666,"caller":"./main.go:107","msg":"Failed to init storage factory","error":"grpc-plugin builder failed to create a store: error connecting to remote storage: context deadline exceeded","stacktrace":"main.main.func1\n\t./main.go:107\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.6.1/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.6.1/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.6.1/command.go:968\nmain.main\n\t./main.go:170\nruntime.main\n\truntime/proc.go:250"}
...

dev.docker-compose.yaml

# cat dev.docker-compose.yaml

# docker-compose setup for development setup.
# Use quickstart.docker-compose.yaml if you just want to try out Kelemetry.
# Use the helm chart if you want to deploy in production.
version: "2.2"
services:
  # ETCD cache storage, only required if etcd cache is used
  etcd:
    image: quay.io/coreos/etcd:v3.2
    entrypoint: [etcd]
    command:
      - -name=main
      - -advertise-client-urls=http://etcd:2479
      - -listen-client-urls=http://0.0.0.0:2479
      - -initial-advertise-peer-urls=http://etcd:2380
      - -listen-peer-urls=http://0.0.0.0:2380
      - -initial-cluster-state=new
      - -initial-cluster=main=http://etcd:2380
      - -initial-cluster-token=etcd-cluster-1
      - -data-dir=/var/run/etcd/default.etcd
    volumes:
      - etcd:/var/run/etcd/
    ports:
      - 2479:2379
    restart: always
  # Web frontend for trace view.
  jaeger-query:
    image: jaegertracing/jaeger-query:1.42
    environment:
      SPAN_STORAGE_TYPE: grpc-plugin
      GRPC_STORAGE_SERVER: host.docker.internal:17272 # run on host directly
    ports:
      - 0.0.0.0:16686:16686
    restart: always
  # OTLP collector that writes to Badger
  jaeger-collector:
    image: jaegertracing/jaeger-collector:1.42
    environment:
      COLLECTOR_OTLP_ENABLED: "true"
      SPAN_STORAGE_TYPE: grpc-plugin
      GRPC_STORAGE_SERVER: remote-badger:17271
    ports:
      - 0.0.0.0:4317:4317
    restart: always
  # Backend badger storage
  # Feel free to override environment.SPAN_STORAGE_TYPE to other storages given the proper configuration.
  remote-badger:
    image: jaegertracing/jaeger-remote-storage:1.42
    environment:
      SPAN_STORAGE_TYPE: badger
      BADGER_EPHEMERAL: "false"
      BADGER_DIRECTORY_KEY: /mnt/badger/key
      BADGER_DIRECTORY_VALUE: /mnt/badger/data
    ports:
      - 127.0.0.1:17272:17271
    volumes:
      - badger:/mnt/badger

  # Web frontend for raw trace database view.
  jaeger-query-raw:
    image: jaegertracing/jaeger-query:1.42
    environment:
      SPAN_STORAGE_TYPE: grpc-plugin
      GRPC_STORAGE_SERVER: remote-badger:17271
    ports:
      - 0.0.0.0:26686:16686
    restart: always

volumes:
  etcd: {}
  badger: {}

Kelemetry version

0.2.2

Environment

k8s:1.24.15 Jaeger:1.4.2

SOF3 commented 1 year ago

the query server depends on the kelemetry storage plugin server. it can't start until it successfully connects to the storage plugin server, which is run in the host by running make. it should be fixed as long as you start running the dev instance.

calvinxu commented 1 year ago

the query server depends on the kelemetry storage plugin server. it can't start until it successfully connects to the storage plugin server, which is run in the host by running make. it should be fixed as long as you start running the dev instance.

-- so base on the current setup for dev, how to work around it? or if I want to deploy using docker, does it support ?

SOF3 commented 1 year ago
calvinxu commented 1 year ago
  • make stack is for developers who are developing Kelemetry and want to test the functionality of Kelemetry on a test server whilst running the Kelemetry binary in terminal foreground. It works normally as long as you run make in your terminal.

run "make" and start kelemetry, any advice on below logs?

INFO[0254] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0254] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0256] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0258] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0260] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0260] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0264] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0264] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0265] pkg/mod/k8s.io/client-go@v0.27.3/tools/cache/reflector.go:231: failed to list *v1.ConfigMap: Get "https://127.0.0.1:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Dkelemetry-filter-config&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0265] pkg/mod/k8s.io/client-go@v0.27.3/tools/cache/reflector.go:231: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://127.0.0.1:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Dkelemetry-filter-config&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0268] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
INFO[0268] error retrieving resource lock default/kelemetry-diff-writer-leader-election-0: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/default/leases/kelemetry-diff-writer-leader-election-0": dial tcp 127.0.0.1:6443: connect: connection refused  mod=kube
SOF3 commented 1 year ago

Did you make kind before make stack?

SOF3 commented 1 year ago

You may also need to export KUBECONFIG to the kubeconfig for the new kind cluster (kind-tracetest).

calvinxu commented 1 year ago

run make and export KUBECONFIG, started and run normal, but it have the same issue to trace the audit-log

SOF3 commented 1 year ago

I can't reproduce this, and it is working correctly in the CI as well. Could you provide more detailed steps to reproduce this?

calvinxu commented 1 year ago

seems I mixed something up in the environment, just ignore my previous comments that mixed with "Audit log is not missed in the tracing data". Just a quick question, is the issue "Audit log is not missed in the tracing data https://github.com/kubewharf/kelemetry/issues/133, that is on Jaeger UI only can see event, no audit logs displayed in the logs list of each span" currently tracked by https://github.com/kubewharf/kelemetry/issues/127 and target to be fixed by this PR? thanks.

SOF3 commented 1 year ago

The debugged issue in #133 is tracked in #127, but there is currently no fix yet because I haven't debugged the cause yet. I am on vacation right now; will have a look next week.