opensearch-project / security-analytics

Security Analytics enables users for detecting security threats on their security event log data. It will also allow them to modify/tailor the pre-packaged solution.
Apache License 2.0
69 stars 72 forks source link

[BUG] Do not use the 'search' queue for everything #875

Open mvanderlee opened 6 months ago

mvanderlee commented 6 months ago

v 2.11.1

Our cluster was stable on a r5.2xlarge instance, hovering at ~10% CPU usage. Then we enabled windows detectors and even a r5.8xlarge isn't enough.

We were experimenting with detectors. But they essentially brought down our entire instance. The main issue can be boiled down to the fact that it's all running in the same 'search' queue. The detector UI is backed by 'search', the detectors themselves are backed by 'search' etc.

Why is this the worst idea ever? Because as the detectors fill up the queue and cause literally millions of searches to be rejected, ~48 Million per hour were observed overnight. While this is a tuning and scaling issue, it also completely killed ingestion (our spark pipeline kept failing to write to OS and dropped it in our DLQ) and all dashboards no longer work since the UI also uses the 'search' queue. So it wasn't just detectors that were failing. Everything started to fail. We couldn't even stop the detector because that request kept failing as well.

We have tried tuning the queues, but even a queue size of 100K is still filling up and we're still running into memory issues.

Management wanted us to try to use Detectors as they were hoping we'd no longer have to maintain our own rules engine with Sigma rules. But it can do the job with far less resources on the exact same data set and not affect anything else if it falls behind.

We are no longer moving forward with OS security analytics.

sbcd90 commented 6 months ago

hi @mvanderlee , we have a bunch of performance fixes we're planning to release for 2.13. We're aware of the high cpu & high jvmmp issues caused by running security-analytics detectors. These issues should go away once the 2.13 release is out.

sbcd90 commented 6 months ago

also, some of the optimizations which you can already try out is using an index alias to configure a detector instead of an index pattern. Here are the steps to do it.

1. ISM Changes

Define Component Template with mappings

PUT /_component_template/test-alias-template458
{"template" : {
  "mappings": {
    "properties": {
      "hello": {
        "type": "text"
      }
    }
  }
}}

Define Index template with the component template

POST /_index_template/test-index-template458
{
  "index_patterns": [
    "test-index458-*"
  ],
  "composed_of": [
    "test-alias-template458"
  ]
}

Create Initial Index

PUT /test-index458-1
{
  "aliases": {
    "test-alias458": {
      "is_write_index": true
    }
  }
}

Index data via the alias

POST /test-alias458/_doc
{
  "hello": "world"
}

use the alias test-alias458 to create the detector now.

mvanderlee commented 6 months ago

@sbcd90 glad to hear it. Until then, can you confirm if rejected tasks mean that events are not being analyzed by the detector, and thus not be alerted upon?

mvanderlee commented 6 months ago

And we already have aliases, but they don't show up as options in the Data source dropdown. We'll try just entering it manually. It'd be great if it could show aliases in the UI and preferably prioritize them.

amsiglan commented 6 months ago

@mvanderlee already working on showing the aliases in the dropdown and should be available in 2.13