openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
17k stars 3.09k forks source link

Support hourly Elasticsearch indexing #2369

Open libeilin opened 5 years ago

libeilin commented 5 years ago

elasticSearch

codefromthecrypt commented 5 years ago

This will not work out of the box as some other logic would need to change. we can leave this to see if it is popular or not

libeilin commented 5 years ago

OK, thanks for your reply, because we have encountered some problems when compiling by ourselves. Therefore, we are looking for your help here.

If this requirement is made, I hope you can release it as soon as possible. At present, the data volume of one day is too large, and the ES query speed cannot keep up with it.

codefromthecrypt commented 5 years ago

@openzipkin/elasticsearch any interest on this?

xeraa commented 5 years ago
  1. I assume this can't be easily fixed with alias trickery, right? 2019-01-01 pointing to 2019-01-01-00 and you switch that every hour. As long as the alias is pointing to a single index you can write to it. Pointing to multiple indices makes it read-only.
  2. Probably the better approach long-term is a rollover index where you can specify a certain age or number of docs or size. I'd generally go for size so you have a very even distribution of data per shard (otherwise weekends might be oversharded and a peak during the week undersharded). Also note that we will very soon have Index Lifecycle Management (ILM) built into Elasticsearch and Kibana, which will make the management of rollover indices and deleting old data much simpler. Though it's under the (free) Basic license and not Apache2 — not sure if that is acceptable to be used in Zipkin then.
codefromthecrypt commented 5 years ago

@xeraa do you know which version rollover index was added? I agree the core issue here is size.

xeraa commented 5 years ago

@adriancole 6.6 (the current version): https://www.elastic.co/guide/en/elasticsearch/reference/6.6/index-lifecycle-management.html

You can fully managed it through the Elasticsearch API, but Kibana also provides a UI for it. And as I said: Not open source but free to use (Basic license).

codefromthecrypt commented 5 years ago

@libeilin before we experiment with a non-OSS feature, can you comment if rollover indexing is desirable? maintaining features has a cost, especially so with non OSS distributions (as it affects how we do testing) so we want to make sure there is user buy-in.

It is also possible for us to explore hourly indexes regardless

codefromthecrypt commented 5 years ago

email related to this thread on our dev list https://lists.apache.org/thread.html/73c2efa69e3ff0a519c6b6c2f5e551159c34902c29df01b2703e9126@%3Cdev.zipkin.apache.org%3E

untergeek commented 5 years ago

There's always Elastic Curator if you want to use Rollover, but are using OSS Elasticsearch (no Basic license). It's OSS, and requires no license.

codefromthecrypt commented 5 years ago

@untergeek thanks for the pointer. I think you are pointing to this specifically right? https://www.elastic.co/guide/en/elasticsearch/client/curator/5.6/ex_rollover.html

To elaborate this approach, we'd need some more details about what this will take in practice in terms of curator config vs index template config, any extra processes curator needs to run, what if anything the aliasing implies when we do reads or writes. I wonder if someone has this setup with a zipkin site already (or anything that uses daily indexes and rollover with no client call changes needed)

singhabhinav03 commented 5 years ago

We recently started using zipkin for opentracing. In our company also requirement is for monthly or weekly zipkin index. It would be great if you add this support.

xeraa commented 5 years ago

Just as an idea: Maybe this is going a bit too deep down the rabbit hole for one datastore and it would make more sense to leave that part to Curator or ILM (by documenting the right configurations to be used)? There are various use cases about time based index patterns, rollover, deletion of data,... that are kind of solved externally already.

codefromthecrypt commented 5 years ago

Just as an idea: Maybe this is going a bit too deep down the rabbit hole for one datastore and it would make more sense to leave that part to Curator or ILM (by documenting the right configurations to be used)? There are various use cases about time based index patterns, rollover, deletion of data,... that are kind of solved externally already.

Yes, curator is how people handle this today, and many can't store months of trace data either :P We currently mention to use curator for index management, but possibly someone can come up with an example https://github.com/apache/incubator-zipkin/blob/8e4ada890c1b4f0f21babaf1a2315af128aeb4f4/zipkin-storage/elasticsearch/README.md#indexes

shakuzen commented 5 years ago

In our company also requirement is for monthly or weekly zipkin index. It would be great if you add this support.

@singhabhinav03 could you elaborate on what you're trying to achieve that you cannot currently? The original request is to be able to have finer-grain indexes than daily because the data volume in one day is too large. Weekly or monthly indexes are only likely usable with relatively small amounts of tracing data.

codefromthecrypt commented 5 years ago

I think this issue got stuck as we were worried about how to address varied granularity. @narayaruna opened #2767 which doesn't imply varied granularity.

If we limit this to hourly indexes, still anyone can use curator or similar to rescale these to daily, weekly monthly.. correct? cc @openzipkin/elasticsearch

xeraa commented 5 years ago

If we limit this to hourly indexes, still anyone can use curator or similar to rescale these to daily, weekly monthly.. correct?

Not sure I'm reading this correctly, but combining hourly indices into a daily one (merging 24 indices) isn't easily possible — that would require a reindex (where you use a script to change the _index field).

My concern with hourly indices is that this will be a lot of shards. Just using 1 primary and 1 replica you'll end up with 48 shards for a single day. Our recommendation is to have less than 20 shards per GB of heap and each shard should be around 10 to 50GB in size. I can see how this works out for some heavy users, but it will be a bad choice for many others.

IMO a combination of rollover and write index alias would be the more generic solution that gives users fewer chances for bad configurations.

Do you have like a sample app where I could add the right config to show how this works? Might be easier than discussing it.

codefromthecrypt commented 5 years ago

@xeraa so I think the concern from @narayaruna is that with TB scale indexes, search, even with our cherry-picked indexing, require bumping read timeouts to 60s.. so more about query side than write side iiuc.

codefromthecrypt commented 5 years ago

so the thinking is.. I wonder.. if for data sets that naturally fit the heap-per-shard guidance at hourly or less, then putting that data in hourly should make more sense than daily. Query side could be better optimized with this as instead of requesting a day index for a search, it could an hourly, without any special features...

am I missing something? (ps thanks for mentioning where hourly does not make sense! possibly we can do a discover check to warn if config doesn't make sense)

xeraa commented 5 years ago

Yes, if you are looking at a short timeframe (like 1h). I'm not sure what the common access pattern is to be honest.

On the other hand if you have a filter on the timeframe and access it frequently enough then that will be cached and should also be pretty fast as well. I couldn't say how much win to expect (depends on so many factors including the access pattern — timeframe and frequency).

codefromthecrypt commented 5 years ago

Literally, the default lookback is 1 hour, and currently, it will grab a day or possibly 2 if just past midnight, to form a query with. This is probably why Nara mentions this, as it lowers the blast of default to max 2 hours if just past the hour.

Screenshot 2019-08-22 at 8 24 28 AM
codefromthecrypt commented 5 years ago

at any rate we could put a branch up and see how it goes. If isn't helpful we wouldn't do it, but for some sites this could be an easy to reason with, low-tech option to speed up some things.

Ack on the reindexing thing if someone needs to re-scale data. We can put more notes in the readme with knowledge gained here regardless of if the change is implemented.

codefromthecrypt commented 5 years ago

PS I opened this because I think I was the one who came up with the hour search default :) https://github.com/openzipkin/zipkin/issues/2772

xeraa commented 5 years ago

Sounds good on trying it out on a branch.

On the re-scaling: Rather than reindexing indices together, you could have an index template with 3 primary shards (just as an example for spreading the ingestion over 3 nodes), but once the index is readonly you could shrink it down to a single primary shard. That should be the better pattern for more parallelization at first and then reducing the number of shards later on. And this is just a question of index template and then Elastic Curator / ILM / ... — would probably just need a little documentation on the Zipkin side.

nitishgoyal13 commented 3 years ago

We too are facing similar issue. Our daily indices are growing into trillions of spans in daily index resulting into slow queries. @libeilin @codefromthecrypt Were you guys able to figure out any workaround this? We are stuck here with our es queries getting timed out

nitishgoyal13 commented 2 years ago

Was anyone able to find a work around this? Are there any MRs which are able to support hourly indices? Any help on above would be really appreciated

xeraa commented 2 years ago

If Zipkin can write to an alias (without any date math) then you could set that up with with ILM (https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html) in the background. That this was part of Zipkin is probably more for historic reasons when Elasticsearch lacked any such features, but things have luckily changed by now.

Delirante commented 2 years ago

+1

rogierslag commented 5 months ago

If Zipkin can write to an alias (without any date math) then you could set that up with with ILM (https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html) in the background. That this was part of Zipkin is probably more for historic reasons when Elasticsearch lacked any such features, but things have luckily changed by now.

I'm also very interesting in such a feature! When storing lots of traces with Zipkin, due to the size of the index, using ILM feature as shrinking or force merging becomes problematic. If we could write to an alias instead, we could automatically roll over the indices every 50GB, seriously reducing the batch size when shrinking and force merging