opensearch-project / alerting

📟 Get notified when your data meets certain conditions by setting up monitors, alerts, and notifications
https://opensearch.org/docs/latest/monitoring-plugins/alerting/index/
Apache License 2.0
58 stars 99 forks source link

Add jvm aware setting and max num docs settings for batching docs for percolate queries #1435

Closed eirsep closed 4 months ago

eirsep commented 5 months ago

With these changes the number of docs submitted in a single percolate query is not naively set per shard or per index. Rather we have 2 settings to decide how many docs to submit for percolate query in doc level monitor

Solves the followping problems

IllegalStateException[Failed to run percolate search for sourceIndex [cloudtrail_alias] and queryIndex [.opensearch-sap-cloudtrail-detectors-queries-000001] for 180000 document(s)];
 nested: SearchPhaseExecutionException[SearchTask was cancelled]; 
nested: TaskCancelledException[org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: cancelled task with reason: Cancellation timeout of 5m is expired]; 
nested: OpenSearchRejectedExecutionException[cancelled task with reason: Cancellation timeout of 5m is expired];
"error_message" : "IllegalStateException[Failed to run percolate search for sourceIndex [log-aws-cloudtrail-2023-08] and queryIndex [.opensearch-sap-cloudtrail-detectors-queries-000001] for 10000 document(s)]; 
nested: SearchPhaseExecutionException[all shards failed]; 
nested: [cancelled task with reason: heap usage exceeded [45.9mb >= 9.2mb]]; 
nested: OpenSearchRejectedExecutionException[cancelled task with reason: heap usage exceeded [45.9mb >= 9.2mb]];

Issue #, if available: Optimize doc level monitor performance: Batch docs for percolate query searches based on available memory and cpu #1353

Description of changes:

Log message from opensearch cluster, when setting is at 40k docs per batch and 10% of heap to break batch and perform percolate query for ingestion rate of 250K docs per minute

Monitor org.opensearch.client.node.NodeClient@1440ce1 PERF_DEBUG: Percolate query time taken millis = 9.4s

Old Latency of percolate query : 5+ minutes leading to cancellation. New latency <1 minute

opensearch-trigger-bot[bot] commented 4 months ago

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.x
# Create a new branch
git switch --create backport-1435-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 f643454a22b239a283e47c29222e561d238de42e
# Push it to GitHub
git push --set-upstream origin backport-1435-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport-1435-to-2.x.