jaegertracing / jaeger-operator

Jaeger Operator for Kubernetes simplifies deploying and running Jaeger on Kubernetes.
https://www.jaegertracing.io/docs/latest/operator/
Apache License 2.0
1.01k stars 342 forks source link

[Bug]: Cassandra spark-dependencies seems to be broken #2508

Open iblancasa opened 5 months ago

iblancasa commented 5 months ago

What happened?

When running the cassandra-spark E2E test, the pod from the spark job fails:

k logs test-spark-deps-spark-dependencies-28508319-z4cmv
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/app/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {10.244.0.19}:9042
    at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:168)
    at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$sessionCache$1(CassandraConnector.scala:154)
    at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32)
    at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
    at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57)
    at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:79)
    at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
    at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:122)
    at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:332)
    at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:352)
    at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider.tableDef(CassandraTableRowReaderProvider.scala:50)
    at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider.tableDef$(CassandraTableRowReaderProvider.scala:50)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:63)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:63)
    at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider.verify(CassandraTableRowReaderProvider.scala:137)
    at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider.verify$(CassandraTableRowReaderProvider.scala:136)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:63)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:263)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
    at org.apache.spark.Partitioner$.$anonfun$defaultPartitioner$4(Partitioner.scala:78)
    at org.apache.spark.Partitioner$.$anonfun$defaultPartitioner$4$adapted(Partitioner.scala:78)
    at scala.collection.immutable.List.map(List.scala:293)
    at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:78)
    at org.apache.spark.rdd.PairRDDFunctions.$anonfun$groupByKey$6(PairRDDFunctions.scala:636)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
    at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:636)
    at org.apache.spark.api.java.JavaPairRDD.groupByKey(JavaPairRDD.scala:561)
    at io.jaegertracing.spark.dependencies.cassandra.CassandraDependenciesJob.run(CassandraDependenciesJob.java:169)
    at io.jaegertracing.spark.dependencies.DependenciesSparkJob.run(DependenciesSparkJob.java:60)
    at io.jaegertracing.spark.dependencies.DependenciesSparkJob.main(DependenciesSparkJob.java:40)
Caused by: java.lang.NoClassDefFoundError: com/codahale/metrics/JmxReporter
    at com.datastax.driver.core.Metrics.<init>(Metrics.java:146)
    at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1501)
    at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:451)
    at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:161)
    ... 41 more
Caused by: java.lang.ClassNotFoundException: com.codahale.metrics.JmxReporter
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown Source)
    at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
    ... 45 more

Steps to reproduce

Run the test

Expected behavior

.

Relevant log output

No response

Screenshot

No response

Additional context

No response

Jaeger backend version

No response

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

No response

Deployment configs

No response

iblancasa commented 5 months ago

It seems something changed recently in the image and that is breaking the operator integration

rriverak commented 5 months ago

same issues here, we use jaeger-operator v1.49.0

last successful spark job is 20 days ago

NAME                                                 COMPLETIONS   DURATION   AGE
jaeger-operator-jaeger-cassandra-schema-job          1/1           37s        601d
jaeger-operator-jaeger-spark-dependencies-28491835   1/1           32s        22d
jaeger-operator-jaeger-spark-dependencies-28493275   1/1           36s        21d
jaeger-operator-jaeger-spark-dependencies-28494715   1/1           37s        20d
jaeger-operator-jaeger-spark-dependencies-28523515   0/1           12h        12h

operator's spark job has no tag on the image, so "latest" is used as a fallback.

  Containers:
   jaeger-operator-jaeger-spark-dependencies:
    Image:      ghcr.io/jaegertracing/spark-dependencies/spark-dependencies
    Port:       <none>
    Host Port:  <none>

the image that is used can be found here.

https://github.com/jaegertracing/spark-dependencies/pkgs/container/spark-dependencies%2Fspark-dependencies/versions?filters%5Bversion_type%5D=tagged

chrome_xAJrIp7MNL

sadly there is only 1 tag which was overwritten 20 days ago, that makes any Workaround impossible 😔 any ideas ? In my opinion at least the old image should be provided..

iblancasa commented 5 months ago

We might open an issue in that repository.

rriverak commented 5 months ago

as a workaround, we pin the spark-dependencies to the old untaggt Image sha256:683963b95bafb0721f3261a49c368c7bdce4ddcb04a23116c45068d254c5ec11

we use the Helm Values of the jaeger-operator to override the DockerImage of dependencies in storage section:

jaeger:
  create: true
  spec:
    strategy: production
    storage:
      type: cassandra
      options:
        cassandra:
          servers: xxx
          keyspace: jaeger
          username: xxx
          password: xxx
      dependencies:
        image: ghcr.io/jaegertracing/spark-dependencies/spark-dependencies@sha256:683963b95bafb0721f3261a49c368c7bdce4ddcb04a23116c45068d254c5ec11

However, the current image is broken and should be fixed. In my opinion, the jaeger-operator itself should also pin its own dependencies to avoid this kind of Errors in Production.

iblancasa commented 5 months ago

@rriverak would you like to send a PR?

rriverak commented 5 months ago

@iblancasa I'm not sure. We can solve this on several levels... which solution are we looking for? then i cloud provide a PR accordingly.

What is our path?

I would be happy if spark-dependency shows initiative here and fixes the problems with the Image and switches to a proper Versioning. If this does not happen, then one of the remaining two solutions must do the job.

iblancasa commented 5 months ago

I would prefer the Fix Spark dependencies option. After that one, set the version in the Jaeger operator. The third one is not a real solution since a lot of people are not using Helm to install the operator.