Transition based on cluster free available space

opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size

https://opensearch.org/docs/latest/im-plugin/index/

Apache License 2.0

53 stars 113 forks source link

Transition based on cluster free available space #44

Open adityaj1107 opened 3 years ago

adityaj1107 commented 3 years ago

Issue by drock Monday Jul 27, 2020 at 21:04 GMT Originally opened as https://github.com/opendistro-for-elasticsearch/index-management/issues/260

Problem: Currently ISM allows you to transition an index to a new state based on the index's age. This works generally well for managing the overall size of a logging cluster. This works best when you have a consistent amount of volume that is generated at some interval, such as 100GB per day for example. In this case you can calculate the amount of time to keep an index around before deleting it with pretty good accuracy.

If you have a varying amount of volume, or it is largely unpredictable however, this becomes a problem. You cannot calculate with much certainty whether keeping an index around for 1, 5, 15, etc. days will make you eventually run out of space.

Solution Request It would be great if you could define transitions based upon available space. If you are evenly distributing your data across nodes (which you should) then you could instead configure a policy to delete indices when there is less than 10%, 20%, etc. available space. This would have to delete indices in descending order from oldest to newest of course.

This would be especially beneficial to an organization like ours which has an annually seasonal usage pattern. If I define an ISM policy that keeps my disk usage at bay today, it may no longer work well a month from now when my usage and volume is higher.

This leads us to either have to over-provision our cluster for our heaviest usage or constantly monitor the cluster and keep adjusting the ISM policies.

An added bonus would be to be able to do this based on the storage space of different types of nodes. For example, we run a hot/warm workflow. Our warm nodes can hold far more data than the hot nodes. Thus, it would be optimal to be able to transition indices to warm once the hot nodes are getting full and then again to transition to delete once the warm nodes are getting full.

adityaj1107 commented 3 years ago

Comment by dbbaughe Tuesday Aug 11, 2020 at 04:19 GMT

Hey @drock,

Thanks for opening an issue. I'm guessing using the min_size condition will run into a similar issue. Should let you rollover based on your volume and ingestion patterns, but the part you're looking for is a way to adjust how long that index sticks around after that rollover right?

One thing we want to avoid at least right now with the current architecture is having the individual index jobs rely on information from other jobs or have a side effect on those other jobs. And the main problem I'm thinking of is eventually you'll have n number of indices that are all executing/checking their transition conditions and you'll hit this condition and if you're unlucky you'll have them all transition to the next state (and in your case delete themselves presumably). And unfortunately ISM doesn't execute itself on an interval where it can just check which index is the oldest. Each of these indices (jobs) all execute individually with no knowledge of each other.

It definitely seems like a problem we should try to solve though. So in light of that just to open the discussion (while contradicting myself a bit), what about the possibility of using the total size of the indices an alias points to or an index pattern? If you are rolling over indices then you could possibly designate group_a, group_b, group_c, group_d, etc. all with their own reserved space.

e.g. group_a indices can have up to 400GB of primary store size, group_b indices can have up to 200GB of primary store size, etc. and once they go over that limit we can have the oldest index in that group delete itself.

This would offload the issue of each job having to figure out between other jobs who is oldest and with similar conditions. Although if you ever ended up with an alias group where the oldest was not being managed then all of these would get blocked (perhaps until another condition evaluated to true). Theoretically you could apply an alias to every index being managed and use this on that whole group (which could be every index in the cluster if you manage everything).

Perhaps there is an easier way to solve this problem though, will try to think through it some more.

adityaj1107 commented 3 years ago

Comment by drock Tuesday Aug 11, 2020 at 20:06 GMT

@dbbaughe I like your suggestion. I think its actually better than mine of looking at total available space of the cluster because its more flexible. It would allow me to have multiple groups of indices, each with a maximum allocation of a certain percentage of space like logs can take 60% of the space and metrics can take 30%, etc. Given I know the total size of my cluster before hand, I can easily calculate how much 60% is and put that in the policy.

👍

adityaj1107 commented 3 years ago

Comment by gittygoo Monday Nov 23, 2020 at 14:29 GMT

This is something we would be looking for aswell.

Our scenario is pretty similar: we have a cluster which we allocate some space (say 500GB) and if it hits a certain percentage overall (say 80% usage) we would like to start deleting older indexes until the percentage falls below 80% again...

adityaj1107 commented 3 years ago

Comment by tybalex Wednesday Jun 02, 2021 at 19:32 GMT

my current workaround is take snapshot of cold indices and then delete it soon. However it would be better to have this feature implemented.

adaisley commented 2 years ago

We currently use Curator to manage our Elasticsearch indexes. We’ve set up a system to allow us to define a list of index prefixes, with an allowance of hot data, warm data and a potential number of weeks worth of data to keep for each group. We then use bash to programmatically generate these rules for each prefix, essentially giving us rules on “groups” of indexes.

For example, a given ‘prefix group’ is ‘product-prod-’, with 20gb of hot data, 100gb of warm data and 26 weeks to keep. This will then create a rule that will check to see if all of the product-prod-* indexes with a hot tag exceed 20gb, and move indexes if any older indexes exceed that amount. Then it’ll check to see if the warm tagged ‘group’ of indexes exceed 100gb and delete anything that goes past that threshold. Lastly, if there are any indexes in that group over 26 weeks old (by the pattern in the index name), it’ll get rid of them.

So having the ability to manage indexes based on the index pattern as 'a group' would be a highly desirable feature. Unless anyone fancied maintaining an Opensearch-compatible version of Curator haha.

Would also be nice when selecting the "minimum index age" transition to be able to define a timestring pattern to use for the index age. I'm guessing by default it uses the index creation_age, which for our use case might not be the most reliable, as most of the data we grab uses the client's timestamp, meaning we can have indexes created from years ago/years in the future, lol.

ffatghub commented 1 year ago

Hi all,

we use Curator (like @adaisley) and our use case is the same as @aditjind ("if it hits a certain percentage overall - say 80% usage - we would like to start deleting older indexes until the percentage falls below 80% again").

Any news about this feature?

thank you

Mo0rBy commented 8 months ago

Hi all,

Myself and my team also require this sort of functionality, to delete the oldest items from an index when the index reaches a specific size.

For example:

My index reaches 100Gb in size
A new document is added to the index
The ISM policy is utilised and deletes the oldest items until the index has enough space for the new document

I understand that this won't work in the over-simplified way I have described here, just trying to give an idea of what I (and I think others) need for this feature.

Pigueiras commented 1 month ago

@adityaj1107 I guess this feature (or something similar) is still not available, right? I was looking for a way to set a size limit on a data stream or to set a maximum number of rollovers as a way of limiting the size. However, it seems that the only option at the moment is to set a time limit based on either the index age or the rollover age.

sudhanshuagarwal06 commented 3 weeks ago

Hi,

The same issue I was also facing.

To address the challenge of managing index retention with variable or unpredictable data volumes, a potential solution is to implement a script that monitors the total size of the node. When the total size exceeds a predefined threshold, the script can trigger the deletion of older indices. like this...


from opensearchpy import OpenSearch
from datetime import datetime

host = 'host_url'
auth = ('username', 'your_password') 

client = OpenSearch(
        hosts=[host],
        http_auth=auth
    )

INDEX_PATTERN = 'logs*' # log pattern
DISK_USAGE_THRESHOLD = 80 # percentage threshold 

def get_node_disk_usage(client):
    response = client.nodes.stats(metric='fs')
    nodes_state = {}
    count=0
    for node in response['nodes']:
        node_id = node
        disk_usage = response['nodes'][node]['fs']['total']['total_in_bytes'] - response['nodes'][node]['fs']['total']['available_in_bytes']
        disk_usage_percentage = (disk_usage / response['nodes'][node]['fs']['total']['total_in_bytes']) * 100
        nodes_state[count] = {
            'disk_usage': disk_usage_percentage,
            'node_id': node_id,
            'name': response['nodes'][node]['name']
        } 
        count+=1
    return nodes_state

def get_index_creation_date(client, index_name):
    response = client.indices.get(index=index_name)
    creation_date = response[index_name]['settings']['index']['creation_date']
    creation_date = datetime.fromtimestamp(int(creation_date) / 1000)
    return creation_date

def delete_oldest_index(client, index_pattern):
    indices = client.indices.get(index=index_pattern)
    indices = [index for index in indices]
    if not indices:
        print("No index found.")
        return

    oldest_index = min(indices, key=lambda x: get_index_creation_date(client, x))

    try:
        response = client.indices.delete(index=oldest_index)
        print(f"Deleted index {oldest_index}: {response}")
    except Exception as e:
        print(f"Failed to delete index {oldest_index}: {e}")

def main():
    nodes_state = get_node_disk_usage(client)
    for _, node in nodes_state.items():
        if node['name'] != "os-master":
            print(f"Node {node['name']} disk usage: {node['disk_usage']}%")
            if node['disk_usage'] > DISK_USAGE_THRESHOLD:
                print("Now delete old index")
                delete_oldest_index(client, INDEX_PATTERN)
                break

if __name__ == "__main__":
    main()

Pigueiras commented 3 weeks ago

The purpose of using ISM is to avoid custom scripts 😅

sudhanshuagarwal06 commented 3 weeks ago

Yes, I know but I think this feature still not available😅that's why I wrote costome script.