opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.05k stars 1.67k forks source link

[RFC] OpenSearch Events Correlation Engine #6779

Open sbcd90 opened 1 year ago

sbcd90 commented 1 year ago

Problem Statement

OpenSearch is a scalable, flexible, and extensible open-source software suite for search, analytics, and observability applications licensed under Apache 2.0. OpenSearch includes a data store and search engine where customers can store their business, operational, and security data from a variety of sources & run search queries on them.

Since the various customer infrastructure events, such as security events, observability events etc, spans across multiple indices & data streams, a strong correlation across these indices (or data streams) helps customers to identify patterns and dive into the relationship of events occurring across different systems in their infrastructure.

Definitions

Events Correlation Engine

Correlation Engine is an Events Knowledge Graph which can be used to identify and store connected events data spanning across multiple indices or data streams. Also, it helps generate insights by correlating the recent/historical data based on time windows provided by the client .

The Events Correlation Engine provides an approach to help customers correlate events across log sources by allowing customers to define their own Correlation Rules exactly once, while then generating correlations between events from different log sources automatically.

Dimensions of Correlation

Time Window

Time Window is the most basic Dimension of Correlation that can be defined by the user. Correlation Engine would show all possible correlations across all indices within the specified time window if no other dimension is provided.

Source Events Indices/DataStreams

While Time Window is an important dimension of correlation, users also need to provide source events indices(or datastreams) on which Correlation rules can be defined which acts as an additional dimension of correlation.

Query Language for Correlation Rules

The most granular level of correlation supported by the Correlation Engine is using correlation rules or queries over the source events indices or datastreams. These rules allow the Correlation Engine to eliminate false positives & present to the user a list of highly accurate correlated search results.

One of the popular choices for defining Correlation Rules is Event Query Language(EQL) from Elasticsearch. EQL supports ECS today.

Here is a sample EQL based Correlation Query.

{
  "query": """
      [network where src_addr == "4.5.6.7" and severity_id = -1]
      [ad_ldap where ResultType == 50126]
      [windows where host.hostname == "EC2AMAZ*"]
      [others_application where StatusCode == 403]
      [s3 where aws.cloudtrail.eventName="ReplicateObject"]
  """
}

High-Level Design

There are 2 high level components in the design of Events Correlation Engine .

Correlation Query Service

This sub-system manages the lifecycle of the Correlation Rules created by the users. Users can create, update, read or delete rules using the REST apis provided by this layer.

The language for defining Correlation Rules is still not finalized. EQL is one of the examples for defining Correlation Rules.

Correlation Service

The internals of the Correlation Engine is composed of 4 major components.

Screenshot 2023-03-21 at 12 38 19 PM

Use Cases

Security Analytics Correlation Engine for correlating security events

Security Analytics is an open-source solution for security operations in OpenSearch. Security Analytics’ threat detection engine converts the detection rules into executable OpenSearch queries which are then matched against the logs or events ingested by the user to generate findings. The trigger condition filters are further applied on the findings to generate alerts.

Today in Security Analytics, the generated findings belong to individual log types & there is no way to automatically correlate between them. Users would manually need to browse through the findings generated for individual log categories & then need to identify patterns manually.

The Security Analytics Correlation Engine provides an approach to solve this issue by allowing the customers to define the correlation metadata across log categories exactly once & then generating correlations between findings from different log categories automatically.

Here is link to RFC

praveensameneni commented 1 year ago

Thank you for the proposal @sbcd90 . I think correlation is a great primitive that can be leveraged across different use cases of Security Threat detection, Observability and generic log analytics correlation. Would love to get feedback from the community.

cc: @nknize , @dblock , @CEHENKLE

getsaurabh02 commented 1 year ago

@sbcd90 Thanks for creating the RFC. This is a great start.

Adding some more thoughts to the What, How and Why part of a generic Correlation Engine Framework that we are thinking here:

What is Event Correlation Event correlation automates the process of analyzing the findings (patterns) from the documents which came from the various log sources, to detect incidents and problems with deeper insights and patterns. Using an event correlation helps identify relationships between the events that has occurred now and its previous instances. It also helps to identify relationship from the other log sources based on the common field vectors (such as IP) which otherwise would have gone unnoticed.

How does an Event Correlation work: Writing correlation rules/queries repeatedly is time consuming process for operators and requires a deep understanding of how logs are related and structured. The solution should provide ability to automate some of these steps with a background job, so that relationsal insights can be generated and persisted once the data is available, using the pre-authored rules or criteria:

Why Event Correlation - Use-Cases and Examples

cc: @nknize @dblock

nknize commented 1 year ago

This is great! 💯 agree this should be a core feature. Lets progress not perfection this. A few initial questions / clarifications:

  1. Any opposition to starting as a plugin? We can always decide to move it as a module later if that's the direction the community wants to take it.
  2. What's the initial user facing API? The description above mentions EQL which is for threat modeling. Since this engine will support more use cases can we start w/ a simple DSL API and follow on w/ other languages on a per use case basis? What's that DSL API look like?
  3. What's the surface area change to :server or any other modules or libraries? Depending on this we can decide to forgo feature flagging and just use @opensearch.experimental tags since plugins are optional and not installed by default. No need to shove behind another gate.
sbcd90 commented 1 year ago

hi @nknize , Thanks a lot for reviewing the RFC. Here are my answers.

  1. I agree with the proposal of starting it as a plugin & then moving it as a module later. @getsaurabh02 any thoughts on this?

  2. I have added the api doc with this response below. The api doc has 2 sections. the first section is for the core plugin & the second section is specific to security-analytics UI mockups. We intend to extend the Correlation Engine plugin in core to security-analytics & add security-analytics specific apis in it.

  3. There are no changes required in the :server or any other modules or libraries. We can then just use @opensearch.experimental tags.

Here is the api doc.

Core Plugin apis

Create Correlation Rule for an index/data stream

POST /_plugins/_correlation/rules

Request Body: 
{
  "correlate": [
    {
      "index": "vpc_flow",
      "query": "dstaddr:4.5.6.7 or dstaddr:4.5.6.6"
    },
    {
      "index": "windows",
      "query": "winlog.event_data.SubjectDomainName:NTAUTHORI*"
    },
    {
      "index": "ad_logs",
      "query": "ResultType:50126"
    },
    {
      "index": "app_logs",
      "query": "endpoint:/customer_records.txt"
    }
  ]
}

List Correlations for an event stored in an index/data stream

GET /_plugins/_correlation/events?event_id=425dce0b-f5ee-4889-b0c0-7d15669f0871
&index=ad_logs&nearby_events=20&time_window=10m

Response:

{
  "events": [
    {
      "event": "5c661104-aaa9-484b-a91f-9cad4ae6d5f5",
      "index": "app_logs",
      "score": 0.000015182109564193524
    },
    {
      "event": "2485b623-6573-42f4-a055-9b927e38a65f",
      "index": "ad_logs",
      "score": 0.000001615897872397909
    },
    {
      "event": "051e00ad-5996-4c41-be20-f992451d1331",
      "index": "windows",
      "score": 0.000016230604160227813
    },
    {
      "event": "f11ca8a3-50d7-4074-a951-51439aa9e67b",
      "index": "s3",
      "score": 0.000001759401811796124
    },
    {
      "event": "9b86980e-5fb7-4c5a-bd1b-879a1e3baf12",
      "index": "vpc_flow",
      "score": 0.0000016306962606904563
    },
    {
      "event": "e7dea5a1-164f-48f9-880e-4ba33e508713",
      "index": "vpc_flow",
      "score": 0.00001632626481296029
    }
  ]
}

Security Analytics Plugin Apis

Create Correlation Rules between Log Types

POST /_plugins/_security_analytics/correlation/rules

Request Body:
{
  "correlate": [
    {
      "index": "vpc_flow",
      "query": "dstaddr:4.5.6.7 or dstaddr:4.5.6.6",
      "category": "network"
    },
    {
      "index": "windows",
      "query": "winlog.event_data.SubjectDomainName:NTAUTHORI*",
      "category": "windows"
    },
    {
      "index": "ad_logs",
      "query": "ResultType:50126",
      "category": "ad_ldap"
    },
    {
      "index": "app_logs",
      "query": "endpoint:/customer_records.txt",
      "category": "others_application"
    }
  ]
}

List all findings & their correlations within a time window

POST /_plugins/_security_analytics/correlate/findings/_search?
time_window_start=2023-03-24T00:00:00Z&time_window_end=2023-03-26T00:00:00Z

Request Body:
{
    "from" : 20,
    "size": 10,
    "query": {
        "match": {
            "logType": "windows"
        }
    }
}

Response:
{
  "findings": [
    {
      "finding": "5c661104-aaa9-484b-a91f-9cad4ae6d5f5",
      "detector_type": "others_application",
      "correlated_findings": [{
        "finding": "5c661104-aaa9-484b-a91f-9cad4ae6d5f5",
        "detector_type": "ad_ldap"
      }, {
        "finding": "f11ca8a3-50d7-4074-a951-51439aa9e67b",
        "detector_type": "s3"
      }]
    },
    {
      "finding": "f11ca8a3-50d7-4074-a951-51439aa9e67b",
      "detector_type": "s3",
      "correlated_findings": [{
        "finding": "5c661104-aaa9-484b-a91f-9cad4ae6d5f5",
        "detector_type": "others_application"
      }]
    }
  ]
}

List correlations for a finding belonging to a log type

GET /_plugins/_security_analytics/findings/correlate?finding=425dce0b-f5ee-4889-b0c0-7d15669f0871
&detector_type=ad_ldap&nearby_findings=20&time_window=10m

Response:
{
  "findings": [
    {
      "finding": "5c661104-aaa9-484b-a91f-9cad4ae6d5f5",
      "detector_type": "others_application",
      "score": 0.000015182109564193524
    },
    {
      "finding": "2485b623-6573-42f4-a055-9b927e38a65f",
      "detector_type": "ad_ldap",
      "score": 0.000001615897872397909
    },
    {
      "finding": "051e00ad-5996-4c41-be20-f992451d1331",
      "detector_type": "windows",
      "score": 0.000016230604160227813
    },
    {
      "finding": "f11ca8a3-50d7-4074-a951-51439aa9e67b",
      "detector_type": "s3",
      "score": 0.000001759401811796124
    },
    {
      "finding": "9b86980e-5fb7-4c5a-bd1b-879a1e3baf12",
      "detector_type": "network",
      "score": 0.0000016306962606904563
    },
    {
      "finding": "e7dea5a1-164f-48f9-880e-4ba33e508713",
      "detector_type": "network",
      "score": 0.00001632626481296029
    }
  ]
}
getsaurabh02 commented 1 year ago

Agree with @sbcd90 on moving ahead with the suggested plan here and starting as a core plugin. It is also the quickest path forward, without requiring any changes in the server upfront. Also, it allows us to keep the changes isolated and sand boxed from performance side of view.

dblock commented 1 year ago

What's the tl;dr of why this feature needs to be in core, especially that it's going to start as _plugins/_correlation? Can it begin as an external plugin?

getsaurabh02 commented 1 year ago

Thanks @dblock for the review. The Correlation Engine we are proposing here aims to provide the capability to build Events Knowledge Graph within the OpenSearch data set, which can be used to identify and store connected data events, possibly spanning across multiple indices or data streams. These knowledge graphs can further help generate insights by correlating the recent or historical data across custom time windows which users can provide.

Since it provides an approach to help users correlate events across log sources, while allowing them to define their own correlation Rules, the framework itself can be leveraged by different end user plugins to solve different end use-cases such as those related to Security Analytics, Observability, geospatial or trace analytics.

Going forward, once we have had baked the feature well as the core plugin, we will have further aim to provide the capability as the core module itself.

YANG-DB commented 1 year ago

This looks very interesting and has great potential , few point to continue the discussion

I'll be happy to discuss more on this Knowledge -Graph !!

Adding the correlation metadata Knowledge into the field mapping API

jmazanec15 commented 1 year ago

@getsaurabh02 @sbcd90 @dblock I think this should be an external plugin and not a core-plugin or module.

In https://github.com/opensearch-project/OpenSearch/pull/7350, a new vector field type is created "correlation_vector". Shouldnt we leverage in some way the existing knn_vector type here? Is the problem that the knn_vector type is implemented as an external plugin? In neural search, we added a dependency on k-NN plugin. My concern is that we now are introducing 2 vector types in OpenSearch that have overlapping functionality, yet do not share any implementation.

@sbcd90 also, please update the RFC to include field type interface that is added as well as query type interface.

cc: @vamshin @navneet1v

sbcd90 commented 1 year ago

hi @jmazanec15, @dblock, thanks a lot for your comments. here are my answers.

I think this should be an external plugin and not a core-plugin or module.

Events Correlation Engine is by definition an Events Knowledge Graph which can be used to identify and store connected events data spanning across multiple indices or data streams within a specific time window. KNN plugin on the other hand provides a generic interface to store vectors & also provides a query interface to run knn queries against those vectors. KNN provides a generic wrapper over faiss, nmslib as well as lucene hnsw graphs. So, by definition, Events Correlation Engine & KNN plugin are completely different & their use-cases are also hugely different from each other.

Today, OpenSearch has no functionality which can correlate events or documents across different indices within a time window. Elasticsearch supports it partially with EQL but it also does not provide correlations between events within a time window. Here is a post in the OpenSearch forum which asks for a functionality like Events Correlation Engine: https://forum.opensearch.org/t/event-correlation-on-opensearch-odfe/6276/4

Events Correlation Engine has several use cases. Finding Correlations across findings generated from security logs RFC, Cluster Insights - to correlate metrics generate from an OS cluster within a time window, Geospatial use cases - find activites happening at different locations across the globe within a time window.

Due to its diverse use cases, we decided to include the events correlation engine plugin in core.

In https://github.com/opensearch-project/OpenSearch/pull/7350, a new vector field type is created "correlation_vector". Shouldnt we leverage in some way the existing knn_vector type here?

The barebone implementation of the Events Correlation Engine is composed of 3 separate pull requests.

The main functionality of Events Correlation Engine is not to expose a new field type or query type for vectors. It is just one of the high level components of Correlation Engine today & is internal to Correlation Engine & not supposed to be used externally by users. KNN plugin povides a much more robust implementation of a new field type or query type for vectors which should be used by users.

As part of the design, we wanted to keep the graph storage & query part of the Events Correlation Engine flexible, by providing implementation not only for lucene hnsw graphs but also for pinecone, yang db as well as for Amazon NeptuneDB for managed service in future.

Lastly, of course we can replace the lean OS to Lucene storage/query converter introduced in pr #7350 with the KNN plugin wrapper around Lucene HNSW graphs. But, currently, KNN plugin is not in core. Until it is in core, we want to continue using our converter.

also, please update the RFC to include field type interface that is added as well as query type interface.

The RFC currently does not include the low-level design of any of the components of the Events Correlation Engine. So, if we want to include the new field type or query type for vectors as part of RFC, we would need to add the entire low-level design of all the components of the Correlation Engine which may make the RFC too long. i would leave this question for @getsaurabh02 & @praveensameneni to answer.

@jmazanec15, @dblock, kindly let me know if you disagree with any of the points mentioned above. also, kindly let me know if you have more questions on the design of Events Correlation Engine. thanks in advance.

dblock commented 1 year ago

I am struggling to convince myself one way or another of whether this plugin belongs in core. I think we could go either way. Maybe we should ask some other folks to get a strong opinion? @nknize?

nknize commented 1 year ago

The foundation correlation engine absolutely makes sense as a core plugin (to start) then possibly promoted as a module. Plugins can build on the core engine for use case specific correlation rules such as for security, observability, geospatial, etc. Core use cases (e.g., general correlation across primitives) include users providing custom correlation rules for their specific use cases that can be implemented using a core default language (e.g., PPL).

Shouldnt we leverage in some way the existing knn_vector type here?

For matrix stats this makes sense. Since matrix stats computes correlation and covariance matrices across multiple fields, I've long wanted to add vector field support to that aggregation. We should explore that separately. This correlation engine, on the other hand, is not mathematical correlation, it's event correlation based on user defined rules. And event correlation across documents : 💯 makes sense as a core search capability.

jmazanec15 commented 11 months ago

The main functionality of Events Correlation Engine is not to expose a new field type or query type for vectors. It is just one of the high level components of Correlation Engine today & is internal to Correlation Engine & not supposed to be used externally by users. KNN plugin povides a much more robust implementation of a new field type or query type for vectors which should be used by users. As part of the design, we wanted to keep the graph storage & query part of the Events Correlation Engine flexible, by providing implementation not only for lucene hnsw graphs but also for pinecone, yang db as well as for Amazon NeptuneDB for managed service in future.

I see. If it is not supposed to be used externally by users, we should make sure that it is not (not sure if this is already done). Ideally, it should just use the knn_vector type, but I understand the issue if the engine is going into core. It may make sense to start thinking about moving some of k-NN function into core, but that is for another discussion.

The foundation correlation engine absolutely makes sense as a core plugin (to start) then possibly promoted as a module. Plugins can build on the core engine for use case specific correlation rules such as for security, observability, geospatial, etc. Core use cases (e.g., general correlation across primitives) include users providing custom correlation rules for their specific use cases that can be implemented using a core default language (e.g., PPL).

I see. I understand the argument for making it a module. I am not sure in our current project structure, where plugins are developed externally, if core plugins as a concept make sense. From my understanding, it is left over from elasticsearch, which had a different philosophy around plugins. If the case for making it a core plugin is for ease of promotion to module, that makes sense.

kmahyyg commented 10 months ago

Hi @sbcd90

Elasticsearch supports it partially with EQL but it also does not provide correlations between events within a time window.

I'm sorry if I misunderstand your point of view. Elastic Security offered a feature for detection rules to search events within a time window and additional loopback time, which partially allow events to be correlated in a time window. However, this correlation cannot be done based on event context (maybe some field are the same, just like timeline feature of MS Sentinel.) Check https://www.elastic.co/guide/en/security/current/rules-ui-create.html for a ref.

As SOC analysts, event correlation is really important to us and can enhance the usefulness of current detection feature that based on Sigma rules greatly. BTW, Sigma rules are currently trying to evolve to its 2.0 version, which also requires correlation.

nuubnoob commented 3 months ago

i am working on a opensearch security analytics project i want to create correlations between findings of log type linux system logs, one of the detector works on custom rule that triggers on authentication token and other one that uses more than 100 rules some of which are History file deletion, chmod suspicious directory, Linux Remote System Discovery...etc but all the findings of second detector are created for RuleToDetectWhenTaskDeleted rule how can i create a correlation rule for creating correlation with this available data and also If there is any dummy data available for correlation creation can someone point me in that direction