vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.01k stars 1.47k forks source link

Support index lifecycle management for Elasticsearch sink #7745

Open jszwedko opened 3 years ago

jszwedko commented 3 years ago

Support the same options that logstash does for managing index lifecycles: https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/master/docs/index.asciidoc#index-lifecycle-management

spencergilbert commented 3 years ago

I've found that the ILM configuration in logstash is pretty bad, IMHO - It might be better to focus on data streams support?

jszwedko commented 3 years ago

@spencergilbert we plan to do both, but we can focus on datastreams first. Do you have thoughts on what an improved version of index lifecycle management might look like? Or do you think it's not even worth it given datastreams? We've had some users ask for ILM support.

spencergilbert commented 3 years ago

I think all in on datastreams is probably better, by my understanding it manages a lot of the ILM work the client would need to implement.

Logstash ILM was painful when I used it because you can't/couldn't supply the alias as a template based on log fields.

jdrouet commented 3 years ago

There is a "problem" regarding this: the index name is a template so we cannot say, without a given event, what will be the indexes. This implies that we'll have to upsert the ILM and templates definition when we receive an event. In that case, each time we'll receive events, we'll have to do 3 calls: 1 to create the ILM, 1 to create the template and 1 to push the metrics. This would most probably kill the performances of the sink. We could think of having a cache at the vector level, which could become tricky when you're running several instances of vector in parallel and increase the memory usage. Now, if we take a look at an other sinks, Clickhouse, it needs a migration to work. Maybe, elasticsearch would need a migration to work as well.

jszwedko commented 3 years ago

After some discussion we've decided to punt on this for now given we have added datastreams support which seems to be the ordained path for getting observability data into Elasticsearch and handles index lifecycle management for you. We'll leave this open to collect additional use-cases for ILM though.

antgel commented 2 years ago

Hi, hopefully this isn't too OT - after reading the above, looks like we need to get into data streams. Can somebody explain how data streams "handle index lifecycle management for you"? According to the docs it's still necessary to set up ILM. What have I missed?

spencergilbert commented 2 years ago

Hi, hopefully this isn't too OT - after reading the above, looks like we need to get into data streams. Can somebody explain how data streams "handle index lifecycle management for you"? According to the docs it's still necessary to set up ILM. What have I missed?

I think particularly data streams avoid the hassle described here: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html#manage-time-series-data-without-data-streams

pgvishnuram commented 1 year ago

@jszwedko @spencergilbert - is this feature support worked upon

jszwedko commented 1 year ago

@jszwedko @spencergilbert - is this feature support worked upon

Not currently. We'd be happy to review a proposal for it though! Many users seem to have moved onto data streams for telemetry data.

DekelDevunet commented 7 months ago

@spencergilbert @jszwedko how can one decide on which data_stream name, index template, or ILM policy vector is going to use?

From my understanding I need to create the index template and ilm policy before hand. But I am not sure how to setup vector in a way that will work with my custom ilm and index template.