jaegertracing / spark-dependencies

Spark job for dependency links
http://jaegertracing.io/
Apache License 2.0
124 stars 69 forks source link

[Bug]: Jaeger spark job failing to generate dependecy graph with opensearch 2.x as storage backend #129

Closed bharatbandu1 closed 5 months ago

bharatbandu1 commented 1 year ago

What happened?

jaeger-spark job pod for dependency graph generation, failing with CrashLoopBackOff error. Looks like the job using search_type as scan, which is deprecated now. We are using opensearch2.0 as storage backend. The valid values are query_then_fetch, dfs_query_then_fetch.

Steps to reproduce

  1. Install the jaeger chart in kubernetes cluster(1.18 version) using https://jaegertracing.github.io/helm-charts

Expected behavior

spark job should run to completion and we should be able to see the dependency graph on Jaeger UI.

Relevant log output

22/12/02 13:24:52 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stag
e 0.0 (TID 0, localhost, executor driver): org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exc
eption: No search type for [scan]
{"query":{"range":{"startTimeMillis":{"gte":"now-now-1h"}}}}
        at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:469)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:426)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:408)
        at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:311)
        at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:93)
        at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
        at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
        at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
        at io.jaegertracing.spark.dependencies.DependenciesSparkHelper.derive(DependenciesSparkHelper.java:44)
        at io.jaegertracing.spark.dependencies.elastic.ElasticsearchDependenciesJob.run(ElasticsearchDependenciesJob.java:237)
        at io.jaegertracing.spark.dependencies.elastic.ElasticsearchDependenciesJob.run(ElasticsearchDependenciesJob.java:212)
        at io.jaegertracing.spark.dependencies.DependenciesSparkJob.run(DependenciesSparkJob.java:54)
        at io.jaegertracing.spark.dependencies.DependenciesSparkJob.main(DependenciesSparkJob.java:40)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: No search type for [scan
]
{"query":{"range":{"startTimeMillis":{"gte":"now-now-1h"}}}}
        at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:469)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:426)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:408)
        at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:311)
        at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:93)
        at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Screenshot

No response

Additional context

No response

Jaeger backend version

jaeger-collector:1.37.0

SDK

opentelemetry-javaagent-all

Pipeline

OTEL java SDK --> otel-collector --> jaeger-collector --> opensearch 2.x

Stogage backend

opensearch 2.x

Operating system

centos 7

Deployment model

kubernetes bare metal

Deployment configs

USER-SUPPLIED VALUES:
USER-SUPPLIED VALUES: null
agent:
  affinity: {}
  annotations: {}
  cmdlineParams: {}
  daemonset:
    updateStrategy: {}
    useHostPort: false
  dnsPolicy: ClusterFirst
  enabled: false
  extraConfigmapMounts: []
  extraEnv: []
  extraSecretMounts: []
  image: jaegertracing/jaeger-agent
  imagePullSecrets: []
  nodeSelector: {}
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  priorityClassName: ""
  pullPolicy: IfNotPresent
  resources: {}
  securityContext: {}
  service:
    annotations: {}
    binaryPort: 6832
    compactPort: 6831
    loadBalancerSourceRanges: []
    samplingPort: 5778
    type: ClusterIP
    zipkinThriftPort: 5775
  serviceAccount:
    annotations: {}
    automountServiceAccountToken: false
    create: false
    name: null
  serviceMonitor:
    additionalLabels: {}
    enabled: false
  tolerations: []
  useHostNetwork: false
cassandra:
  config:
    cluster_name: jaeger
    dc_name: dc1
    endpoint_snitch: GossipingPropertyFileSnitch
    rack_name: rack1
    seed_size: 1
  persistence:
    enabled: false
collector:
  affinity: {}
  annotations: {}
  autoscaling:
    enabled: false
    maxReplicas: 10
    minReplicas: 2
  cmdlineParams: {}
  dnsPolicy: ClusterFirst
  enabled: true
  extraConfigmapMounts: []
  extraSecretMounts: []
  image: jaegertracing/jaeger-collector
  imagePullSecrets: []
  ingress:
    annotations: {}
    enabled: false
  nodeSelector: {}
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  priorityClassName: ""
  pullPolicy: IfNotPresent
  replicaCount: 1
  resources: {}
  securityContext: {}
  service:
    annotations: {}
    grpc:
      port: 14250
    http:
      port: 14268
    loadBalancerSourceRanges: []
    type: ClusterIP
    zipkin: {}
  serviceAccount:
    annotations: {}
    automountServiceAccountToken: false
    create: true
    name: null
  serviceMonitor:
    additionalLabels: {}
    enabled: false
  tolerations: []
elasticsearch: {}
esIndexCleaner:
  affinity: {}
  annotations: {}
  cmdlineParams: {}
  concurrencyPolicy: Forbid
  enabled: false
  extraConfigmapMounts: []
  extraEnv: []
  extraSecretMounts: []
  failedJobsHistoryLimit: 3
  image: jaegertracing/jaeger-es-index-cleaner
  imagePullSecrets: []
  nodeSelector: {}
  numberOfDays: 7
  podAnnotations: {}
  podLabels: {}
  podSecurityContext:
    runAsUser: 1000
  pullPolicy: Always
  resources: {}
  schedule: 55 23 * * *
  securityContext:
    runAsUser: 1000
  serviceAccount:
    automountServiceAccountToken: false
    create: true
    name: null
  successfulJobsHistoryLimit: 3
  tag: latest
  tolerations: []
esLookback:
  affinity: {}
  annotations: {}
  cmdlineParams: {}
  concurrencyPolicy: Forbid
  enabled: false
  extraConfigmapMounts: []
  extraEnv:
  - name: UNIT
    value: days
  - name: UNIT_COUNT
    value: "7"
  extraSecretMounts: []
  failedJobsHistoryLimit: 3
  image: jaegertracing/jaeger-es-rollover
  imagePullSecrets: []
  nodeSelector: {}
  podAnnotations: {}
  podLabels: {}
  podSecurityContext:
    runAsUser: 1000
  pullPolicy: Always
  resources: {}
  schedule: 5 0 * * *
  securityContext: {}
  serviceAccount:
    automountServiceAccountToken: false
    create: true
    name: null
  successfulJobsHistoryLimit: 3
  tag: latest
  tolerations: []
esRollover:
  affinity: {}
  annotations: {}
  cmdlineParams: {}
  concurrencyPolicy: Forbid
  enabled: false
  extraConfigmapMounts: []
  extraEnv:
  - name: CONDITIONS
    value: '{"max_age": "1d"}'
  extraSecretMounts: []
  failedJobsHistoryLimit: 3
  image: jaegertracing/jaeger-es-rollover
  imagePullSecrets: []
  initHook:
    annotations: {}
    extraEnv: []
    podAnnotations: {}
    podLabels: {}
    ttlSecondsAfterFinished: 120
  nodeSelector: {}
  podAnnotations: {}
  podLabels: {}
  podSecurityContext:
    runAsUser: 1000
  pullPolicy: Always
  resources: {}
  schedule: 10 0 * * *
  securityContext: {}
  serviceAccount:
    automountServiceAccountToken: false
    create: true
    name: null
  successfulJobsHistoryLimit: 3
  tag: latest
  tolerations: []
extraObjects: []
fullnameOverride: ""
hotrod:
  affinity: {}
  enabled: false
  image:
    pullPolicy: Always
    pullSecrets: []
    repository: jaegertracing/example-hotrod
  ingress:
    annotations: {}
    enabled: false
    hosts:
    - chart-example.local
    tls: null
  nodeSelector: {}
  podSecurityContext: {}
  replicaCount: 1
  resources: {}
  securityContext: {}
  service:
    annotations: {}
    loadBalancerSourceRanges: []
    name: hotrod
    port: 80
    type: ClusterIP
  serviceAccount:
    automountServiceAccountToken: false
    create: true
    name: null
  tolerations: []
  tracing:
    host: null
    port: 6831
ingester:
  affinity: {}
  annotations: {}
  autoscaling:
    enabled: false
    maxReplicas: 10
    minReplicas: 2
  cmdlineParams: {}
  dnsPolicy: ClusterFirst
  enabled: false
  extraConfigmapMounts: []
  extraSecretMounts: []
  image: jaegertracing/jaeger-ingester
  imagePullSecrets: []
  nodeSelector: {}
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  pullPolicy: IfNotPresent
  replicaCount: 1
  resources: {}
  securityContext: {}
  service:
    annotations: {}
    loadBalancerSourceRanges: []
    type: ClusterIP
  serviceAccount:
    automountServiceAccountToken: false
    create: true
    name: null
  serviceMonitor:
    additionalLabels: {}
    enabled: false
  tolerations: []
kafka:
  autoCreateTopicsEnable: true
  replicaCount: 1
  zookeeper:
    replicaCount: 1
    serviceAccount:
      create: true
nameOverride: ""
provisionDataStore:
  cassandra: false
  elasticsearch: false
  kafka: false
query:
  affinity: {}
  agentSidecar:
    enabled: true
  annotations: {}
  cmdlineParams: {}
  dnsPolicy: ClusterFirst
  enabled: true
  extraConfigmapMounts: []
  extraEnv: []
  extraVolumes: []
  image: jaegertracing/jaeger-query
  imagePullSecrets: []
  ingress:
    annotations: {}
    enabled: false
    health:
      exposed: false
  nodeSelector: {}
  oAuthSidecar:
    args: []
    containerPort: 4180
    enabled: false
    extraConfigmapMounts: []
    extraEnv: []
    extraSecretMounts: []
    image: quay.io/oauth2-proxy/oauth2-proxy:v7.1.0
    pullPolicy: IfNotPresent
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  priorityClassName: ""
  pullPolicy: IfNotPresent
  replicaCount: 1
  resources: {}
  securityContext: {}
  service:
    annotations: {}
    loadBalancerSourceRanges: []
    port: 80
    type: ClusterIP
  serviceAccount:
    annotations: {}
    automountServiceAccountToken: false
    create: true
    name: null
  serviceMonitor:
    additionalLabels: {}
    enabled: false
  sidecars: []
  tolerations: []
schema:
  activeDeadlineSeconds: 300
  annotations: {}
  extraEnv: []
  image: jaegertracing/jaeger-cassandra-schema
  imagePullSecrets: []
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  pullPolicy: IfNotPresent
  resources: {}
  securityContext: {}
  serviceAccount:
    automountServiceAccountToken: true
    create: true
    name: null
spark:
  affinity: {}
  annotations: {}
  cmdlineParams: {}

  concurrencyPolicy: Forbid
  enabled: true
  extraConfigmapMounts:
    - name: jaeger-tls
      mountPath: /tls
      subPath: ""
      configMap: jaeger-tls
      readOnly: true
  extraEnv:
  - name: ES_SSL_NO_VERIFY
    value: "true"
  - name: VERIFY_CERTS
    value: "false"
  - name: "JAVA_OPTS"
    value: "-Djavax.net.ssl.trustStore=/tls/trust.store -Djavax.net.ssl.trustStorePassword=xxxx"
  - name: "ES_TIME_RANGE"
    value: "now-1h"
  - name: "ES_VERSION"
    value: "7"

  extraSecretMounts: []
  path: /etc/pki/java/cacerts
  hostPath: /cacert.pem
  failedJobsHistoryLimit: 5
  image: jaegertracing/spark-dependencies
  imagePullSecrets: []
  nodeSelector: {}
  podAnnotations: {}
  podLabels: {}
  pullPolicy: Always
  resources: {}
  schedule: 10,20,25,30,40,50,0 * * * *
  serviceAccount:
    automountServiceAccountToken: false
    create: true
    name: null
  successfulJobsHistoryLimit: 5
  tag: latest
  tolerations: []
storage:
  cassandra:
    cmdlineParams: {}
    existingSecret: jaeger-elassandra-auth
    extraEnv: []
    host: elassandra.elassandra.svc.cluster.local
    keyspace: jaeger
    password: null
    port: 9042
    tls:
      enabled: false
      secretName: cassandra-tls-secret
    usePassword: false
    user: jaeger
  elasticsearch:
    cmdlineParams: {es.version=7, es.tls.enabled=true, es.tls.skip-host-verify=true, es.create-index-templates=false}
    extraEnv: []
    host: xxxxx
    nodesWanOnly: true
    password: admin
    port: 9200
    scheme: https
    usePassword: true
    user: admin
  kafka:
    authentication: none
    brokers:
    - kafka:9092
    extraEnv: []
    topic: jaeger-test
  type: elasticsearch
tag: ""
lauferism commented 1 year ago

@bharatbandu1 - I upgraded today to the latest opensearch, version 2.5 and it is suddenly working.

yurishkuro commented 5 months ago

helm-chart may be using the wrong image, the official images are on ghcr.io.