Debugging Jaeger Query (missing traces)

weeco commented 4 years ago

Any open questions to address

I have deployed Jaeger 1.14 (Agents, Collector and Query ui as separate services) and use Elasticsearch (7.4) as backend. I ensured that Jaeger traces and spans land in Elasticsearch and I can also query them in Kibana.

Unfortunately I can not see any services/spans in the Jaeger Query ui, nor do I see any error/warn/debug log messages in the query service which would help me to figure out why I can't see these spans. Can you point me in the right direction to figure out why I don't see any spans/traces in the Query UI even though I have plenty of data in Elasticsearch?

jpkrohling commented 4 years ago

Please, check the troubleshooting guide: https://www.jaegertracing.io/docs/1.14/troubleshooting/

weeco commented 4 years ago

Well there's nothing related to debugging the Query UI which cannot find these spans and traces in Elasticsearch. Most of the guide focuses on the write path, which however works for me.

jpkrohling commented 4 years ago

Sorry, looks like I read your question a bit too fast. To be honest, I have not seen this before. The only thing I could offer right now is to double-check your connection settings, comparing the one from Query with the one from the Collector.

You could also post the logs from your collector and query, perhaps we can spot something that you haven't...

weeco commented 4 years ago

I believe the elasticsearch configuration is fine and it can also connect to Elasticsearch. When I use a wrong config the query service does log that on startup like this:

{"level":"info","ts":1572364318.4869714,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1572364318.4872856,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1572364318.487388,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":16687}
{"level":"info","ts":1572364318.487408,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":16687,"health-status":"unavailable"}
{"level":"fatal","ts":1572364324.5567038,"caller":"query/main.go:88","msg":"Failed to init storage factory","error":"failed to create primary Elasticsearch client: Head https://logging-es-http.elastic-system.svc:9201: context deadline exceeded","errorVerbose":"Head https://logging-es-http.elastic-system.svc:9201: context deadline exceeded\nfailed to create primary Elasticsearch client\ngithub.com/jaegertracing/jaeger/plugin/storage/es.(*Factory).Initialize\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/plugin/storage/es/factory.go:83\ngithub.com/jaegertracing/jaeger/plugin/storage.(*Factory).Initialize\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/plugin/storage/factory.go:108\nmain.main.func1\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/query/main.go:87\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:762\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/query/main.go:130\nruntime.main\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/runtime/proc.go:200\nruntime.goexit\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/runtime/asm_amd64.s:1337","stacktrace":"main.main.func1\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/query/main.go:88\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:762\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/query/main.go:130\nruntime.main\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/runtime/proc.go:200"}

With the correct ElasticSearch settings the log looks fine:

kubectl logs -f jaeger-query-74db5fd5c5-l2cr6
2019/10/29 16:03:37 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
{"level":"info","ts":1572365017.8689969,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1572365017.8692174,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1572365017.8693018,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":16687}
{"level":"info","ts":1572365017.869334,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":16687,"health-status":"unavailable"}
{"level":"info","ts":1572365017.9112318,"caller":"config/config.go:172","msg":"Elasticsearch detected","version":7}
{"level":"info","ts":1572365017.9125092,"caller":"healthcheck/handler.go:130","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1572365017.91256,"caller":"app/server.go:135","msg":"Starting CMUX server","port":16686}
{"level":"info","ts":1572365017.9125957,"caller":"app/server.go:112","msg":"Starting HTTP server","port":16686}
{"level":"info","ts":1572365017.9126284,"caller":"app/server.go:125","msg":"Starting GRPC server","port":16686}

I tend to believe that the query UI is for some reason not able to see/query the data in Elasticsearch (maybe because I am using ES v7?).

Collector logs:

kubectl logs -f jaeger-collector-556598c676-8qs6c
2019/10/29 12:42:47 maxprocs: Updating GOMAXPROCS=2: determined from CPU quota
{"level":"info","ts":1572352967.9401166,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1572352967.94039,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1572352967.9405358,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14269}
{"level":"info","ts":1572352967.9405482,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14269,"health-status":"unavailable"}
{"level":"info","ts":1572352968.021824,"caller":"config/config.go:172","msg":"Elasticsearch detected","version":7}
{"level":"info","ts":1572352968.6172075,"caller":"static/strategy_store.go:79","msg":"No sampling strategies provided, using defaults"}
{"level":"info","ts":1572352968.6176662,"caller":"collector/main.go:128","msg":"Starting jaeger-collector TChannel server","port":14267}
{"level":"info","ts":1572352968.6177654,"caller":"grpcserver/grpc_server.go:102","msg":"Starting jaeger-collector gRPC server","grpc-port":"14250"}
{"level":"info","ts":1572352968.617974,"caller":"collector/main.go:147","msg":"Starting jaeger-collector HTTP server","http-port":14268}
{"level":"info","ts":1572352968.618001,"caller":"healthcheck/handler.go:130","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1572352968.6600127,"caller":"collector/main.go:242","msg":"Listening for Zipkin HTTP traffic","zipkin.http-port":9411}

Kubernetes deployment of query component:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: jaeger
    component: query
  name: jaeger-query
  namespace: jaeger
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: jaeger
      component: query
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "16687"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: jaeger
        component: query
      namespace: default
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - jaeger
                - key: component
                  operator: In
                  values:
                  - query
              topologyKey: failure-domain.beta.kubernetes.io/zone
            weight: 100
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - jaeger
                - key: component
                  operator: In
                  values:
                  - query
              topologyKey: kubernetes.io/hostname
            weight: 20
      automountServiceAccountToken: false
      containers:
      - env:
        - name: QUERY_BASE_PATH
          value: /
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: https://logging-es-http.elastic-system.svc:9200
        - name: LOG_LEVEL
          value: debug
        - name: ES_TLS_CA
          value: /etc/jaeger/elasticsearch-certs/ca.crt
        image: jaegertracing/jaeger-query:1.14.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: health
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: query
        ports:
        - containerPort: 16686
          name: ui
          protocol: TCP
        - containerPort: 16687
          name: health
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: health
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 500Mi
        volumeMounts:
        - mountPath: /etc/jaeger/elasticsearch-certs
          name: elasticsearch-certs
      dnsPolicy: ClusterFirst
      nodeSelector:
        cloud.google.com/gke-nodepool: default-v2
      securityContext: {}
      volumes:
      - name: elasticsearch-certs
        secret:
          defaultMode: 420
          optional: false
          secretName: logging-es-http-certs-public

jpkrohling commented 4 years ago

@kevinearls, @pavolloffay, does it ring a bell?

kevinearls commented 4 years ago

@jpkrohling Not that I've seen, sorry. I haven't ever used ES v7 though, so maybe that has something to do with it.

pavolloffay commented 4 years ago

If the Jaeger services started well it prooves the connection to Elasticsearch is fine.

The issue seems to be in reporting spans.

@weeco how do you report spans to Jaeger?

weeco commented 4 years ago

@pavolloffay Jaeger traces are reported by Cortex v0.3.0 (https://github.com/cortexproject/cortex) and I run everything in Kubernetes. I have a daemonset for the jaeger agents, and a deployment for the jaeger collector and query.

Just in case you've missed it: I ensured that spans and traces actually land in Elasticsearch (see Screenshot).

Screenshot 2019-10-31 at 10 46 27

I could share the deployment manifests of all involved components (jaeger agent, daemonset and collector) if you think this could help?

Side note: I had Jaeger working (using pretty much the same manifests) with Elasticsearch version 6.8 and jaeger v.1.12

pavolloffay commented 4 years ago

It indeed seems like a problem on the query side.

I have tried ES 7.4 with all-in-one 1.14 and I was able to see traces from jaeger-query (it traces itself).

docker run --rm -it -e SPAN_STORAGE_TYPE=elasticsearch -e ES_SERVER_URLS=http://elasticsearch:9200 --link elasticsearch -p 16686:16686 jaegertracing/all-in-one:1.14.0
docker run -it --rm -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node"  --name=elasticsearch  docker.elastic.co/elasticsearch/elasticsearch-oss:7.4.1

You can also try to get a trace from query REST API e.g. http://<HOST>:16686/api/traces/566ee6fdd7645562

weeco commented 4 years ago

I picked a traceId from Kibana tried a REST call on the querier and it returned 404:

URL: http://localhost:16686/api/traces/369af4970a2ad8fd

{"data":null,"total":0,"limit":0,"offset":0,"errors":[{"code":404,"msg":"trace not found"}]}

I am not sure why it can not find the data in Elasticsearch. Is it something with the indices in Elasticsearch maybe?

pavolloffay commented 4 years ago

Could you please paste here indices?

weeco commented 4 years ago

Screenshot 2019-10-31 at 13 34 47

jaeger-span-write settings:

{
  "settings": {
    "index": {
      "mapping": {
        "nested_fields": {
          "limit": "50"
        }
      },
      "number_of_shards": "5",
      "provided_name": "jaeger-span-write",
      "creation_date": "1572280321087",
      "requests": {
        "cache": {
          "enable": "true"
        }
      },
      "number_of_replicas": "1",
      "uuid": "Mrs1XgSqQHa4rMiIIY0-AA",
      "version": {
        "created": "7040099"
      }
    }
  },
  "defaults": {
    "index": {
      "flush_after_merge": "512mb",
      "max_inner_result_window": "100",
      "unassigned": {
        "node_left": {
          "delayed_timeout": "1m"
        }
      },
      "max_terms_count": "65536",
      "lifecycle": {
        "name": "",
        "rollover_alias": "",
        "indexing_complete": "false"
      },
      "routing_partition_size": "1",
      "force_memory_term_dictionary": "false",
      "max_docvalue_fields_search": "100",
      "merge": {
        "scheduler": {
          "max_thread_count": "1",
          "auto_throttle": "true",
          "max_merge_count": "6"
        },
        "policy": {
          "reclaim_deletes_weight": "2.0",
          "floor_segment": "2mb",
          "max_merge_at_once_explicit": "30",
          "max_merge_at_once": "10",
          "max_merged_segment": "5gb",
          "expunge_deletes_allowed": "10.0",
          "segments_per_tier": "10.0",
          "deletes_pct_allowed": "33.0"
        }
      },
      "max_refresh_listeners": "1000",
      "max_regex_length": "1000",
      "load_fixed_bitset_filters_eagerly": "true",
      "number_of_routing_shards": "1",
      "write": {
        "wait_for_active_shards": "1"
      },
      "verified_before_close": "false",
      "mapping": {
        "coerce": "false",
        "nested_objects": {
          "limit": "10000"
        },
        "depth": {
          "limit": "20"
        },
        "ignore_malformed": "false",
        "field_name_length": {
          "limit": "9223372036854775807"
        },
        "total_fields": {
          "limit": "1000"
        }
      },
      "source_only": "false",
      "soft_deletes": {
        "enabled": "false",
        "retention": {
          "operations": "0"
        },
        "retention_lease": {
          "period": "12h"
        }
      },
      "max_script_fields": "32",
      "query": {
        "default_field": [
          "*"
        ],
        "parse": {
          "allow_unmapped_fields": "true"
        }
      },
      "format": "0",
      "frozen": "false",
      "sort": {
        "missing": [],
        "mode": [],
        "field": [],
        "order": []
      },
      "priority": "1",
      "codec": "default",
      "max_rescore_window": "10000",
      "max_adjacency_matrix_filters": "100",
      "analyze": {
        "max_token_count": "10000"
      },
      "gc_deletes": "60s",
      "optimize_auto_generated_id": "true",
      "max_ngram_diff": "1",
      "translog": {
        "generation_threshold_size": "64mb",
        "flush_threshold_size": "512mb",
        "sync_interval": "5s",
        "retention": {
          "size": "512MB",
          "age": "12h"
        },
        "durability": "REQUEST"
      },
      "auto_expand_replicas": "false",
      "mapper": {
        "dynamic": "true"
      },
      "data_path": "",
      "highlight": {
        "max_analyzed_offset": "1000000"
      },
      "routing": {
        "rebalance": {
          "enable": "all"
        },
        "allocation": {
          "enable": "all",
          "total_shards_per_node": "-1"
        }
      },
      "search": {
        "slowlog": {
          "level": "TRACE",
          "threshold": {
            "fetch": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            },
            "query": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            }
          }
        },
        "idle": {
          "after": "30s"
        },
        "throttled": "false"
      },
      "fielddata": {
        "cache": "node"
      },
      "default_pipeline": "_none",
      "max_slices_per_scroll": "1024",
      "shard": {
        "check_on_startup": "false"
      },
      "xpack": {
        "watcher": {
          "template": {
            "version": ""
          }
        },
        "version": "",
        "ccr": {
          "following_index": "false"
        }
      },
      "percolator": {
        "map_unmapped_fields_as_text": "false"
      },
      "allocation": {
        "max_retries": "5"
      },
      "refresh_interval": "1s",
      "indexing": {
        "slowlog": {
          "reformat": "true",
          "threshold": {
            "index": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            }
          },
          "source": "1000",
          "level": "TRACE"
        }
      },
      "compound_format": "0.1",
      "blocks": {
        "metadata": "false",
        "read": "false",
        "read_only_allow_delete": "false",
        "read_only": "false",
        "write": "false"
      },
      "max_result_window": "10000",
      "store": {
        "stats_refresh_interval": "10s",
        "type": "",
        "fs": {
          "fs_lock": "native"
        },
        "preload": []
      },
      "queries": {
        "cache": {
          "enabled": "true"
        }
      },
      "warmer": {
        "enabled": "true"
      },
      "max_shingle_diff": "3",
      "query_string": {
        "lenient": "false"
      }
    }
  }
}

jaeger-service-write settings

{
  "settings": {
    "index": {
      "mapping": {
        "nested_fields": {
          "limit": "50"
        }
      },
      "number_of_shards": "5",
      "provided_name": "jaeger-service-write",
      "creation_date": "1572280321400",
      "requests": {
        "cache": {
          "enable": "true"
        }
      },
      "number_of_replicas": "1",
      "uuid": "_JVNNgJBT6i5vgnTpb0V9g",
      "version": {
        "created": "7040099"
      }
    }
  },
  "defaults": {
    "index": {
      "flush_after_merge": "512mb",
      "max_inner_result_window": "100",
      "unassigned": {
        "node_left": {
          "delayed_timeout": "1m"
        }
      },
      "max_terms_count": "65536",
      "lifecycle": {
        "name": "",
        "rollover_alias": "",
        "indexing_complete": "false"
      },
      "routing_partition_size": "1",
      "force_memory_term_dictionary": "false",
      "max_docvalue_fields_search": "100",
      "merge": {
        "scheduler": {
          "max_thread_count": "1",
          "auto_throttle": "true",
          "max_merge_count": "6"
        },
        "policy": {
          "reclaim_deletes_weight": "2.0",
          "floor_segment": "2mb",
          "max_merge_at_once_explicit": "30",
          "max_merge_at_once": "10",
          "max_merged_segment": "5gb",
          "expunge_deletes_allowed": "10.0",
          "segments_per_tier": "10.0",
          "deletes_pct_allowed": "33.0"
        }
      },
      "max_refresh_listeners": "1000",
      "max_regex_length": "1000",
      "load_fixed_bitset_filters_eagerly": "true",
      "number_of_routing_shards": "1",
      "write": {
        "wait_for_active_shards": "1"
      },
      "verified_before_close": "false",
      "mapping": {
        "coerce": "false",
        "nested_objects": {
          "limit": "10000"
        },
        "depth": {
          "limit": "20"
        },
        "ignore_malformed": "false",
        "field_name_length": {
          "limit": "9223372036854775807"
        },
        "total_fields": {
          "limit": "1000"
        }
      },
      "source_only": "false",
      "soft_deletes": {
        "enabled": "false",
        "retention": {
          "operations": "0"
        },
        "retention_lease": {
          "period": "12h"
        }
      },
      "max_script_fields": "32",
      "query": {
        "default_field": [
          "*"
        ],
        "parse": {
          "allow_unmapped_fields": "true"
        }
      },
      "format": "0",
      "frozen": "false",
      "sort": {
        "missing": [],
        "mode": [],
        "field": [],
        "order": []
      },
      "priority": "1",
      "codec": "default",
      "max_rescore_window": "10000",
      "max_adjacency_matrix_filters": "100",
      "analyze": {
        "max_token_count": "10000"
      },
      "gc_deletes": "60s",
      "optimize_auto_generated_id": "true",
      "max_ngram_diff": "1",
      "translog": {
        "generation_threshold_size": "64mb",
        "flush_threshold_size": "512mb",
        "sync_interval": "5s",
        "retention": {
          "size": "512MB",
          "age": "12h"
        },
        "durability": "REQUEST"
      },
      "auto_expand_replicas": "false",
      "mapper": {
        "dynamic": "true"
      },
      "data_path": "",
      "highlight": {
        "max_analyzed_offset": "1000000"
      },
      "routing": {
        "rebalance": {
          "enable": "all"
        },
        "allocation": {
          "enable": "all",
          "total_shards_per_node": "-1"
        }
      },
      "search": {
        "slowlog": {
          "level": "TRACE",
          "threshold": {
            "fetch": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            },
            "query": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            }
          }
        },
        "idle": {
          "after": "30s"
        },
        "throttled": "false"
      },
      "fielddata": {
        "cache": "node"
      },
      "default_pipeline": "_none",
      "max_slices_per_scroll": "1024",
      "shard": {
        "check_on_startup": "false"
      },
      "xpack": {
        "watcher": {
          "template": {
            "version": ""
          }
        },
        "version": "",
        "ccr": {
          "following_index": "false"
        }
      },
      "percolator": {
        "map_unmapped_fields_as_text": "false"
      },
      "allocation": {
        "max_retries": "5"
      },
      "refresh_interval": "1s",
      "indexing": {
        "slowlog": {
          "reformat": "true",
          "threshold": {
            "index": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            }
          },
          "source": "1000",
          "level": "TRACE"
        }
      },
      "compound_format": "0.1",
      "blocks": {
        "metadata": "false",
        "read": "false",
        "read_only_allow_delete": "false",
        "read_only": "false",
        "write": "false"
      },
      "max_result_window": "10000",
      "store": {
        "stats_refresh_interval": "10s",
        "type": "",
        "fs": {
          "fs_lock": "native"
        },
        "preload": []
      },
      "queries": {
        "cache": {
          "enabled": "true"
        }
      },
      "warmer": {
        "enabled": "true"
      },
      "max_shingle_diff": "3",
      "query_string": {
        "lenient": "false"
      }
    }
  }
}

pavolloffay commented 4 years ago

Your collector is probably configured to use rollover --es.use-aliases=true. Whereas query is looking for daily indices. You have to either use daily indices or rollover in both components.

Note that running rollover indices requires cronjob and an initialization step to use it properly.

weeco commented 4 years ago

@pavolloffay You were absolutely right, I am sorry for wasting your time. That was not very clear for me though, hopefully not too many users will run into the same issue.

pavolloffay commented 4 years ago

You can refer to this blog post to configure it properly https://medium.com/jaegertracing/using-elasticsearch-rollover-to-manage-indices-8b3d0c77915d

jaegertracing / jaeger

Debugging Jaeger Query (missing traces) #1885

Any open questions to address