opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.43k stars 1.72k forks source link

Document level TTL for Index #7826

Open zoman-jayesh opened 1 year ago

zoman-jayesh commented 1 year ago

Is your feature request related to a problem? Please describe. For many of our indexes, We are required to keep data only for a certain duration say 5-7 days. Currently, we have to run a query to figure out which data was inserted on T-7 Days and then find their keys and then delete them. This is a lengthy and problematic process, especially for KNN Indices. For large indexes, this results in a large chunk of the data being deleted as daily insertions are high and can sometimes lead to cluster downtime.

Describe the solution you'd like Allowing to set a document-level TTL property, Where each document can be configured to be active for the specified duration post insertion and then get marked for deletion automatically once that duration elapses. This allows us to reuse the same index, while preventing a large load at the cluster for delete

Describe alternatives you've considered Figuring out first which ids/documents were inserted on the day in question, then sending a high number of deletion requests to the index.

Additional context Add any other context or screenshots about the feature request here.

andrross commented 1 year ago

@zoman-jayesh Have you looked at data streams? Data streams to automatically roll over the underlying index along with an ISM policy to delete the underlying indexes at a certain age is generally the recommended pattern for a use case like this.

zoman-jayesh commented 1 year ago

@andrross I do not want the index to be deleted. Let me give you an example. I have a KNN Index

on Index Created on 1st June 1st June: ids 1 to 100 are inserted 2 June: ids 101 to 200 are inserted 3 June: ids 201 to 300 are inserted 4 June: ids 301 to 400 are inserted 5 June: ids 401 to 500 are inserted while ids 1 to 100 are auto deleted 6th June: ids 501 to 600 are inserted while ids 101 to 200 are auto deleted

Not this should happen automatically. The Index contains both scale and knn fields. I would live to setup an ism policy however I was not able to find how this would happen without causing down time or issues. Can you give me an example for the same? Currently also I don't have a timestamp field in my index

shwetathareja commented 1 year ago

@zoman-jayesh - have you considered setting up an ISM job (Index State Management) which will delete documents using delete_by_query API on a periodic basis. This will require a field in the document like created_date which can be used to query the relevant documents which are eligible for deletion.

andrross commented 1 year ago

@zoman-jayesh With a data stream, you can rollover to a new backing index on each day, then delete the backing indexes that are more than 5 days old. You can search the stream as if it is a single index and it will search all the backing indexes. ISM should be able to automate this for you. The deletions will be very fast because OpenSearch doesn't need to search for data to delete and can just remove the entire backing index for that day. The downside is that the deletions are for every write for a given day, as opposed to querying for a specific timestamp. Also this assumes you have a pure append-only use case and in fact want to delete everything in the index older than a specific day.

Bukhtawar commented 1 year ago

There are performance implications with expiring individual documents and would also be resource intensive, which is something we should avoid.

nandi-github commented 1 year ago

We are hearing similar requirements for compliance reasons wherein customers are required to delete certain documents after a set number of time. The aging option should have both options, to be set at the time of creation and/or set the aging at a later time.

In terms of implementation, it is up to you to engineer an optimal solution

The usecases are meeting the compliance requirements like GDPR