opensearch-project / alerting

📟 Get notified when your data meets certain conditions by setting up monitors, alerts, and notifications
https://opensearch.org/docs/latest/monitoring-plugins/alerting/index/
Apache License 2.0
60 stars 102 forks source link

[FEATURE] Optimize bucket level monitor querying aliases to query only those indices that can contain relevant docs #1710

Open eirsep opened 1 day ago

eirsep commented 1 day ago

Is your feature request related to a problem? Bucket level monitors are periodic jobs that execute an aggregation search query on a set of indices. If an alias is configured in datasource to be queried, bucket-level monitors currently execute aggregation queries against all indices within an alias, even if those indices fall outside the query's time range. This can lead to significant performance degradation, especially when dealing with large numbers of indices or indices residing in colder storage tiers. Time range of bucket level monitor If the aggregation search query has a time range filter, it supports a field period_end that is a search parameter which user can use verbatim and will be replaced with time of monitor execution. In below

Example search query

{
       "size": 0,
       "query": {
         "bool": {
           "filter": [{
             "range": {
               "timestamp": {
                 "from": "{{period_end}}||-1h",
                 "to": "{{period_end}}",
                 "include_lower": true,
                 "include_upper": true,
                 "format": "epoch_millis",
                 "boost": 1
               }
             }
           }],
           "adjust_pure_negative": true,
           "boost": 1
         }
       },
       "aggregations": .....
     }

user is querying last 1 hr of data every time the monitor executes by signifying start time interval as "from": "{{period_end}}||-1h" and end time of interval "to": "{{period_end}}"

This enhancement works for aliases that do rollover and ingesting time series data. This enhancement proposes optimizing monitor execution by resolving aliases to only those indices that potentially contain data within the query's time range. This optimization will be applied when the aggregation query includes a time range filter using the period_end search parameter.

By limiting the number of indices queried, we can significantly reduce query execution time and improve overall monitor performance.

Benefits:

What solution would you like? Check if bucket level monitor datasource is an alias If alias, check if it has a time frame mentioned in query If timeframe interval present, fetch only 2 types of indices of that alias - Indices that are created after the start of the timeframe interval the one index chronologically just before the list of fetched indices in 1 (for example: if timeframe is 1 hr and current time is 5 pm that makes start of interval is 4 pm and end of interval is 5 pm. we need indices created after 4 pm and the one index which was prolly created at 3.30 pm as it will have 4 pm data) That way we filter out warm indices and other indices which don’t have data from that interval What alternatives have you considered? A clear and concise description of any alternative solutions or features you've considered.

shwetathareja commented 23 hours ago

@eirsep Thanks for the proposal. If I understand correctly you are looking can_match sort of behavior to skip shards which dont fall into primary sort range.

eirsep commented 15 hours ago

@shwetathareja that's right but can match is still a Pre-filter phase of a search query which if querying Ultrawarm indices would require the ultrawarm nodes to download the indices onto cluster before executing can_match

In this case i am simply calculating that from resolving indices knowing that timeseries data would only have monotonically increasing timestamps and simply picking the indices by creation date would suffice.

shwetathareja commented 2 hours ago

@eirsep - Index creation times might be misleading if customer is running backfill or there was some issue at user client or OpenSearch service side and ingestion was delayed. It is better to rely on the timestamp of the actual data ingested.

But thats good point if lowest timestamp across shards can be populated in the index property once it is marked read-only.