openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
17.01k stars 3.09k forks source link

Support for a single global index #3711

Closed Tigwin closed 9 months ago

Tigwin commented 9 months ago

I have a special case where getting more resources in AWS is problematic because of budget constraints, however we have no issue with resources for log storage - hence I have an index on a very large ES cluster that I can export logs to but I can't get new indexes made. Since it's a company cluster, they can handle expiration of data too.

Getting funding for new aws resources is very difficult, in fact I previously had a project that had aws resources repeatedly denied even though the project could save the company millions in costs if it was allowed to continue. It took three attempts and director level overrides to get the $300/mo in additional aws costs approved.

So I asked the lead dev if we could modify the zipkin code dealing with index, but that isn't desirable as the company doesn't like to maintain 3rd party software. So at this point I either need a global index feature, fight for aws resources, or find another project similar to zipkin (even though we have already integrated zipkin).

I know it's a long shot, just posting to see if this is perhaps a useful feature for anyone else.

codefromthecrypt commented 9 months ago

Thanks @Tigwin for the context!

Specifically, we are talking about an option to go from this:

To just a global index like this:

To recap from chat, the main reasons we have daily index patterns is to:

Technically, I think the work is likely not too bad vs trying to support arbitrary patterns, but users would need to accept the risk that if they don't expire data, routine queries for recent traces may become unusable and the dependencies job may take forever or a cluster to complete. It might help to suggest how you plan to expire data, in order to help others.

Implementation would require changes here and also in the zipkin-dependencies repo to make sure it works as there are assumptions about index patterns I think.

In Zipkin, we usually follow a "rule of three" where we block on three sites interest for tricky features. Ping back when there are at least three (đź‘Ť ) on this and no other contributors saying "please don't". You can self-thumbs up for simpler counting ;)

FWIW, I'm ok with this, but do think we need the feature to have more popularity prior to starting the work.

cc @openzipkin/elasticsearch

xeraa commented 9 months ago

How do you expire data? _delete_by_query? That is really expensive (in comparison to dropping a daily or timeframe based index) and something we (strongly) don't recommend.

I'm a bit cautious here because this might fix your very specific problem but is a bad idea for pretty much everyone else.

codefromthecrypt commented 9 months ago

I'm a bit cautious here because this might fix your very specific problem but is a bad idea for pretty much everyone else.

ack on cars driving towards cliff. That said, if there was a way to do a global index it certainly wouldn't be by default or recommended, and I'll add this reason to the list if something happens ;)

Regardless, I'm also very curious on how expiration could be handled. maybe truncate the entire index daily, or drop it and have zipkin recreate it? Something like this might be ok if you don't strictly depend on trace data such as for Q/A sites, or if the setup is short lived or secondary for another reasons.

codefromthecrypt commented 9 months ago

another anecdote is that while over the last few years a lot of instrumentation both stopped sampling and also create a lot of data per trace, zipkin was initially designed for sampled data and not a lot of it.

Depending on the site and throughput and tracing policy, there could be no dire concern. It is a function of span count, cardinality and size which is all about how the site is setup. You can collect more data in one day one site than another might collect in a year.

reta commented 9 months ago

@Tigwin I think you could easily get what you want with aliases:

PS: (If an index alias points to one index and is_write_index isn’t set, the index automatically acts as the write index. )

Tigwin commented 9 months ago

@Tigwin I think you could easily get what you want with aliases:

Unfortunately I don't have administrative access to the cluster. I can just write to the index they've created for me.

reta commented 9 months ago

@Tigwin I think you could easily get what you want with aliases:

Unfortunately I don't have administrative access to the cluster. I can just write to the index they've created for me.

No need for admin access, only manage index privileges (https://www.elastic.co/guide/en/elasticsearch/reference/current/security-privileges.html#privileges-list-indices), just in case you could negotiate this part

Tigwin commented 9 months ago

@reta thank you, I'll check to see if that is possible.

I have one other question, is it possible for zipkin to hand off the values it would have written to ES to a local file or a buffer/variable? My system already writes to a log file (which is exported to ES via fluentd). If zipkin could hand off it's values to the parent app, I could combine its log to the application log and avoid having zipkin dealing with ES altogether.

eg; Currently the app writes all details known about a user request to the app log, including a thread_id. And then we can use thread_id in zipkin to pull the backtrace for further details. If I can get the zipkin data directly, I can include both in the fluentd log upload. Resolves my index problem and I won't have to join the two anymore since they'd be combined already.

Thanks for all the help everyone, I don't think I've ever seen a project this responsive.

codefromthecrypt commented 9 months ago

is it possible for zipkin to hand off the values it would have written to ES to a local file or a buffer/variable

In general, it would be a custom StorageComponent which does something different when consuming spans (e.g. holding a buffer or something else) @jeqo has made some amazing tools for forking data

https://github.com/openzipkin-contrib/zipkin-storage-forwarder https://github.com/openzipkin-contrib/zipkin-storage-kafka

as well there is/was https://github.com/ExpediaGroup/pitchfork by @worldtiki, which is currently dormant, but as zipkin itself proved, dormancy can be resolved if forces align

none of these are log focused at the moment, but might help jog ideas

codefromthecrypt commented 9 months ago

@Tigwin sorry I forgot.. while it would be a bit hacky to scrape it, you technically can see the data written to elasticsearch by setting ES_HTTP_LOGGING=body like so https://github.com/openzipkin/zipkin/tree/master/zipkin-server#elasticsearch-storage. Basically this logs to the console the api requests made to elasticsearch.

Tigwin commented 9 months ago

@Tigwin I think you could easily get what you want with aliases:

PS: (If an index alias points to one index and is_write_index isn’t set, the index automatically acts as the write index. )

@reta

After talking it over, it sounds like this fixes my problem and it doesn't require any code changes, which is perfect. And the alias is a one time thing, so it doesn't require elevated privileges for us either.

Will zipkin run into any issues when it tries to create a new daily index that matches the alias?

reta commented 9 months ago

Will zipkin run into any issues when it tries to create a new daily index that matches the alias?

That may be an issue, https://opster.com/es-errors/an-index-or-data-stream-exists-with-the-same-name-as-the-alias/

xeraa commented 9 months ago

The "Elastic way" of managing the (write) alias, creating new indices, deleting old data,... would be Index Lifecycle Management (ILM). That would need to be set up once by the operators of the cluster but after that Zipkin would only need read / write permissions to an index pattern.

If you have a conflict in naming, you could use the _clone API, which is like a hardlink. Then you could delete the old index name and replace it with an alias. You'll see _reindex as a common suggestion but that is a much, much heavier operation that I'd avoid if you can.

codefromthecrypt commented 9 months ago

ps sorry I suggested a logging option which stopped working recently. I opened an issue on it #3712. I was going to suggest that if you want to see if something works, you can try on our examples.. and you still can, except that ES_HTTP_LOGGING won't show you the requests.. that said you can still play around with the server in a possibly easier way than wondering "what if X?" https://github.com/openzipkin/zipkin/tree/master/docker/examples#elasticsearch

worldtiki commented 9 months ago

In general, it would be a custom StorageComponent which does something different when consuming spans (e.g. holding a buffer or something else) @jeqo has made some amazing tools for forking data

https://github.com/openzipkin-contrib/zipkin-storage-forwarder https://github.com/openzipkin-contrib/zipkin-storage-kafka

as well there is/was https://github.com/ExpediaGroup/pitchfork by @worldtiki, which is currently dormant, but as zipkin itself proved, dormancy can be resolved if forces align

none of these are log focused at the moment, but might help jog ideas

Just replying to this specific comment about pitchfork.

The project is indeed dormant as one of the primary goals was to allow a dual Zipkin/Haystack installation, and since Haystack was abandoned there weren't many reasons to keep this one going.

I'm happy to revive pitchfork if there's interested in a similar dual system with Zipkin and something else (although it does seem like a niche thing).

codefromthecrypt commented 9 months ago

@Tigwin any summary here?