jaegertracing / spark-dependencies

Spark job for dependency links
http://jaegertracing.io/
Apache License 2.0
124 stars 69 forks source link

Job doesn't work with Opensearch #110

Open armanzor opened 3 years ago

armanzor commented 3 years ago

Describe the bug Cannot get dependencies feature on "System architecture" tab in Jaeger UI with Opensearch 1.1.0 as backend. In case of switching to Elasticsearch 7.10.1 works without any problems.

To Reproduce Steps to reproduce the behavior:

  1. Run opensearch container in Docker: docker run --detach --name opensearch -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearchproject/opensearch:1.1.0
  2. Run Jaeger: docker run --detach --name jaeger --link opensearch --env SPAN_STORAGE_TYPE=opensearch --env ES_SERVER_URLS=http://opensearch:9200 -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 14268:14268 -p 14250:14250 jaegertracing/all-in-one:1.27 --es.num-replicas=1 --es.num-shards=1
  3. Run example app: docker run --detach --name hotrod --link jaeger -p8080-8083:8080-8083 -e JAEGER_AGENT_HOST="jaeger" jaegertracing/example-hotrod:1.27 all
  4. Tap on buttons in HotRod UI to collect data in database
  5. Run dependencies job: docker run --rm --link opensearch --env STORAGE=elasticsearch --env ES_NODES=http://opensearch:9200 jaegertracing/spark-dependencies, it exits with error

Expected behavior Storing result to jaeger-dependencies-YYYY-MM-DD index, displaying it on Jaeger UI

Screenshots

$ docker run --rm --link opensearch --env STORAGE=elasticsearch --env ES_NODES=http://opensearch:9200 jaegertracing/spark-dependencies
21/11/03 12:27:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/11/03 12:27:36 INFO ElasticsearchDependenciesJob: Running Dependencies job for 2021-11-03T00:00Z, reading from jaeger-span-2021-11-03 index, result storing to jaeger-dependencies-2021-11-03
Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: invalid map received dynamic_templates=[{span_tags_map={path_match=tag.*, mapping={ignore_above=256, type=keyword}}}, {process_tags_map={path_match=process.tag.*, mapping={ignore_above=256, type=keyword}}}]
        at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseField(FieldParser.java:146)
        at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseMapping(FieldParser.java:88)
        at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseIndexMappings(FieldParser.java:69)
        at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseMappings(FieldParser.java:40)
        at org.elasticsearch.hadoop.rest.RestClient.getMappings(RestClient.java:321)
        at org.elasticsearch.hadoop.rest.RestClient.getMappings(RestClient.java:307)
        at org.elasticsearch.hadoop.rest.RestRepository.getMappings(RestRepository.java:293)
        at org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:252)
        at org.elasticsearch.spark.rdd.AbstractEsRDD.esPartitions$lzycompute(AbstractEsRDD.scala:79)
        at org.elasticsearch.spark.rdd.AbstractEsRDD.esPartitions(AbstractEsRDD.scala:78)
        at org.elasticsearch.spark.rdd.AbstractEsRDD.getPartitions(AbstractEsRDD.scala:48)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
        at org.apache.spark.Partitioner$$anonfun$4.apply(Partitioner.scala:75)
        at org.apache.spark.Partitioner$$anonfun$4.apply(Partitioner.scala:75)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:75)
        at org.apache.spark.rdd.RDD$$anonfun$groupBy$1.apply(RDD.scala:691)
        at org.apache.spark.rdd.RDD$$anonfun$groupBy$1.apply(RDD.scala:691)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.groupBy(RDD.scala:690)
        at org.apache.spark.api.java.JavaRDDLike$class.groupBy(JavaRDDLike.scala:243)
        at org.apache.spark.api.java.AbstractJavaRDDLike.groupBy(JavaRDDLike.scala:45)
        at io.jaegertracing.spark.dependencies.elastic.ElasticsearchDependenciesJob.run(ElasticsearchDependenciesJob.java:236)
        at io.jaegertracing.spark.dependencies.elastic.ElasticsearchDependenciesJob.run(ElasticsearchDependenciesJob.java:212)
        at io.jaegertracing.spark.dependencies.DependenciesSparkJob.run(DependenciesSparkJob.java:54)
        at io.jaegertracing.spark.dependencies.DependenciesSparkJob.main(DependenciesSparkJob.java:40)

Version (please complete the following information):

Jakob3xD commented 2 years ago

Is there any update on this issue? Jaeger-spark is the last system blocking me from upgrading to OpenSearch.

pavolloffay commented 2 years ago

No updates, I am not planning to work on this at least for now.

If somebody has free cycles to contribute this feature I am happy to review and approve.

shnurok672 commented 2 years ago

Hi fellas, I've faced the same issue and got it working with updating elasticsearch-spark-20_2.12 dependency to 7.16.3 version. But this version of spark provides compatibility check for ES version and this could be get around by this setting is Opensearch. https://opensearch.org/docs/latest/clients/agents-and-ingestion-tools/index/ Finally, It works with selfbuilded container and workaround setting in Opensearch. As this looks like workaround, I can't provide the MR. The target solution in my opinion would be to wait till Spark will provide Opensearch client, as Opensearch wouldn't be compatible with Elasticsearch(theoretically).

tronda commented 1 year ago

Hey. Just like @shnurok672 I've made a custom version where I've updated the ElasticSearch-Spark dependency to latest version and adjusted the OpenSearch settings and then the dependency job works. We are getting a lots of warnings in the logs while running:

23/06/06 23:00:08 WARN RestClient: Could not verify server is Elasticsearch! Invalid main action response body format [tag].
23/06/06 23:00:08 WARN RestClient: Could not verify server is Elasticsearch! Invalid main action response body format [build_flavor].
23/06/06 23:00:08 WARN RestClient: Could not verify server is Elasticsearch! ES-Hadoop will require server validation when connecting to an Elasticsearch cluster if that Elasticsearch cluster is v7.14 and up.

I see that the OpenSearch project has released their fork of the ElasticSearch-Spark library. I guess this could be used to create an OpenSearch version of the Spark job. I've tried to get the tests in this project to work, but without success. Being able to contribute to this project is a bit difficult since the tests are a bit complex to get into.

tronda commented 11 months ago

We have a fork where I have updated the dependency towards ElasticSearch client library. We have a custom build based on these changes. https://github.com/DIPSAS/spark-dependencies