Open mvanderlee opened 6 months ago
hi @mvanderlee , we have a bunch of performance fixes we're planning to release for 2.13
. We're aware of the high cpu & high jvmmp issues caused by running security-analytics detectors.
These issues should go away once the 2.13
release is out.
also, some of the optimizations which you can already try out is using an index alias
to configure a detector instead of an index pattern
. Here are the steps to do it.
1. ISM Changes
Define Component Template with mappings
PUT /_component_template/test-alias-template458
{"template" : {
"mappings": {
"properties": {
"hello": {
"type": "text"
}
}
}
}}
Define Index template with the component template
POST /_index_template/test-index-template458
{
"index_patterns": [
"test-index458-*"
],
"composed_of": [
"test-alias-template458"
]
}
Create Initial Index
PUT /test-index458-1
{
"aliases": {
"test-alias458": {
"is_write_index": true
}
}
}
Index data via the alias
POST /test-alias458/_doc
{
"hello": "world"
}
use the alias test-alias458
to create the detector now.
@sbcd90 glad to hear it. Until then, can you confirm if rejected tasks mean that events are not being analyzed by the detector, and thus not be alerted upon?
And we already have aliases, but they don't show up as options in the Data source
dropdown. We'll try just entering it manually.
It'd be great if it could show aliases in the UI and preferably prioritize them.
@mvanderlee already working on showing the aliases in the dropdown and should be available in 2.13
v 2.11.1
Our cluster was stable on a r5.2xlarge instance, hovering at ~10% CPU usage. Then we enabled windows detectors and even a r5.8xlarge isn't enough.
We were experimenting with detectors. But they essentially brought down our entire instance. The main issue can be boiled down to the fact that it's all running in the same 'search' queue. The detector UI is backed by 'search', the detectors themselves are backed by 'search' etc.
Why is this the worst idea ever? Because as the detectors fill up the queue and cause literally millions of searches to be rejected, ~48 Million per hour were observed overnight. While this is a tuning and scaling issue, it also completely killed ingestion (our spark pipeline kept failing to write to OS and dropped it in our DLQ) and all dashboards no longer work since the UI also uses the 'search' queue. So it wasn't just detectors that were failing. Everything started to fail. We couldn't even stop the detector because that request kept failing as well.
We have tried tuning the queues, but even a queue size of 100K is still filling up and we're still running into memory issues.
Management wanted us to try to use Detectors as they were hoping we'd no longer have to maintain our own rules engine with Sigma rules. But it can do the job with far less resources on the exact same data set and not affect anything else if it falls behind.
We are no longer moving forward with OS security analytics.