Closed frittentheke closed 2 years ago
@frittentheke would you like to submit a PR?
@pavolloffay Sure. What should that PR contain then?
The change of the es storage module (https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/dependencystore/storage.go) to read and write to that single index I suppose. The query already uses the timestamp field (https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/dependencystore/storage.go#L111) so that would not even need changing. That would then be fully transparent to the UI, right? Certainly I could also throw together a little PR for the Spark job again (https://github.com/jaegertracing/spark-dependencies/pull/86) to keep compatibility.
Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies? Currently the Spark job uses JavaEsSpark.esJsonRDD which has no optimizations (DataFrames and their pushdown - https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-pushdown would be). So apart from the plain query I added in the PR https://github.com/jaegertracing/spark-dependencies/pull/86 all docs are fetched and instantiated into full Span objects, even though not all fields of the spans are required for the dependency extraction. This causes many gigabytes of data transferred and a massive memory footprint as well as turnover on the JVM running the job.
Also the write the dependency storage is not done via the API but directly to elasticsearch - thus the issue with "fixing" both ends of the equation.
While all of Jaeger is Golang, running Java code and then also using the Spark framework seems a bit overly complex - at least if ElasticSearch is concerned. See my comments regarding using the ES terms API (https://github.com/jaegertracing/spark-dependencies/issues/68#issuecomment-597644484) to keep all of the heavy lifting within the ElasticSearch cluster and only minuscule mounts of data having to be transferred.
But even keeping the current approach - using plain Golang and an Elasticsearch client to iterate over the data would at least keep Jaeger components similar.
The UI does not have to be changed. We just need to change the writer (the writer is not used though) and reader. The dependency storage impl should use the same index names as span storage impl.:
IIRC it is jaeger-span-read
and jaeger-span-write
.
The index cleaner and rollover scripts will have to changed also to support rollover.
Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies?
Any improvements to ES query from the spark dependencies job are welcome. Please create a separate issue.
But even keeping the current approach - using plain Golang and an Elasticsearch client to iterate over the data would at least keep Jaeger components similar.
There are no plans to rewrite the current jobs to Golang, The data aggregations job are memory heavy and in prod systems with a lot of data they might require running a spark/flink cluster. The plans were to provide more aggregations jobs, hence frameworks like spark are useful.
The UI does not have to be changed. We just need to change the writer (the writer is not used though) and reader. The dependency storage impl should use the same index names as span storage impl.: IIRC it is
jaeger-span-read
andjaeger-span-write
.The index cleaner and rollover scripts will have to changed also to support rollover.
I was actually not suggesting / implying to use rollover for storing dependencies actually but just a single index. There are so few documents to hold dependencies (currently it's one per day) it makes no sense to roll over.
But thinking about it: Using rollover in conjunction with ILM (ElasticSearch Index Lifecycle Policies) might make sense just for the much easier housekeeping. Then no external job would be required to delete old indices / data, but simply have ElasticSearch roll and expire indices to your liking, full transparent to the application. We run this setup for the spans / services with great success.
Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies?
Any improvements to ES query from the spark dependencies job are welcome. Please create a separate issue.
See https://github.com/jaegertracing/spark-dependencies/issues/88
@pavolloffay I just pushed a PR: https://github.com/jaegertracing/jaeger/pull/2144 If you happen to like that one - I added the write alias to the Spark job in my PR https://github.com/jaegertracing/spark-dependencies/pull/86 as well .. see: https://github.com/jaegertracing/spark-dependencies/pull/86/commits/ec4c28a298957d62a175f9e03d1321e1a79f1ec8
Slightly off-topic question is ES ILM free to use? It's marked as x-pack feature which is payed extension: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
I was actually not suggesting / implying to use rollover for storing dependencies actually but just a single index. There are so few documents to hold dependencies (currently it's one per day) it makes no sense to roll over.
I am not sure how feasible it would be given the index can last for year(s) and there is no way to remove old documents from it.
Slightly off-topic question is ES ILM free to use? It's marked as x-pack feature which is payed extension: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
Yes @pavolloffay , but in the free tier (no cost).... see https://www.elastic.co/subscriptions But with its smart rules on when to do a rollover and when to shrink or delete indices it really is great to not having to run external jobs (like the curator). Even Jaeger currently "has to" provide the housekeeping for the ElasticSearch storage. Even though I blieve the curator (https://github.com/elastic/curator) with a bit of config could be a good replacement and free you from maintaining esCleaner.py and esRollover.py (https://github.com/jaegertracing/jaeger/tree/master/plugin/storage/es) altogether.
Scripts esCleaner.py
and esRollover.py
are using curator under the hood. But instead of using the curator's configuration files we use the programmatic API. We could not use just the conf files because we needed to perform more actions which were not possible with the config file.
any news?
@AhHa45 yes. I refactored my change to add ES alias / rollover support to Jaeger - check out: https://github.com/jaegertracing/jaeger/pull/2144
Requirement - what kind of business use case are you trying to solve?
Using ElasticSearch as storage, and using it most efficiently.
Problem - what in Jaeger blocks you from solving the requirement?
Currently the dependencies (System Architecture in die UI) are created "per day" and stored in an dedicated ElasticSearch index per day (see: https://github.com/jaegertracing/spark-dependencies/blob/master/jaeger-spark-dependencies-elasticsearch/src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java#L203).
The number of indices (actually the number of shards, but they are closely related) one uses to store data in ElasticSearch shall be kept low as they are not "free" (see. https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster).
So especially when looking at the Jaeger span and service indices - which Jaeger learned to use the rollover API for, in order to keep the number of shards low - using a new index for each day of dependencies to be stored and then only put a single document into that index seems a little excessive.
Proposal - what do you suggest to solve the problem or improve the existing situation?
A coordinated switch in Jaeger as well as in the referred external (Spark) job creating the dependencies to simply store them within a single index with a field to mark which day they belong to.
As for housekeeping: It's one doc per day ... so even if one does never delete any documents that index would not explode in size. But if required / intended this could be done in the Spark job as well. As in "keep for x days" and then delete docs with an older than the mentioned timestamp.
Any open questions to address
-