opensearch-project / security-analytics

Security Analytics enables users for detecting security threats on their security event log data. It will also allow them to modify/tailor the pre-packaged solution.
Apache License 2.0
68 stars 72 forks source link

[FEATURE] Stats API #362

Open eirsep opened 1 year ago

eirsep commented 1 year ago

We need a stats API that gives insights into health and analytics of plugin usage. Stats can tell us how many detector/rule creation failures, success have occurred at a node level.

petardz commented 1 year ago

One of the interesting information for user could be "progress" of detectors, for example, if monitors are keeping up with log indices ingestion doc rate. There is issue created on alerting repo for implementation of Monitor Explain API which could be used for this: https://github.com/opensearch-project/alerting/issues/751

sandeshkr419 commented 1 year ago

Thinking on the possible API structure.

Scope of stats API:

Path and HTTP methods

GET _plugins/_security_analytics/stats
GET _plugins/_security_analytics/stats/<metric>
GET _plugins/_security_analytics/<node-id>/stats
GET _plugins/_security_analytics/<node-id>/stats/<metric>

URL Parameters

node-id: node-id of the node for which the stats are required metric: detectors, detectors_per_log_type, custom_rules, custom_rules_per_log_type

Response

TBD after Response Body Fields review

Response Body Fields

Cluster Level Statistics

Field Name Description
nodes number of total, successful, failed nodes returned in the response.
cluster_name cluster’s name.
cluster_uuid cluster’s uuid.
timestamp unix epoch time of when the cluster was last refreshed.
status The cluster’s health status.
plugin_enabled whether security analytics plugin is enabled or not
detectors details (enabled, defined, in_error) of detectors
detectors_per_log_type details (enabled, defined, in_error) of detectors in each log type  
enabled, defined, error stats of detectors as part of detectors and detectors_per_log_type metric
custom_rules number of custom rules defined
custom_rules_per_log_type number of custom rules defined per log type  

Node Level Statistics

The node level statistics will be calculated at individual node level and will be aggregated over all nodes as well for a holistic overview.

Field Name Description
roles node roles: cluster_manager, data, etc
shards_analyzed shards spanned by enabled detectors  
total_documents total documents in scope of detectors
documents_processed documents scanned by detectors
documents_behind number of documents in a node that are yet to be processed
rules_matched rules matched by detectors
jobs_started_on_time detectors started on time on that node

Task Breakthrough

The plan is to get a working API with minimal information ready and then add on statistics as required.

  1. Implement Cluster Level Statistics
  2. Implement Node Level Statistics
  3. [Will create a separate issue] UI / Dashboard Changes
  4. [Will create a separate issue] Documentation changes

References

Used the below APIs to decide on structure of stats API here.

eirsep commented 1 year ago

can you post an example response?

sandeshkr419 commented 12 months ago

@eirsep Sure. After re-iterating through the request and responses, here is the updated proposal. I have limited the response objects to make it look more cleaner and avoid unnecessary information in the first implementation of stats API.

Request:

GET _plugins/_security_analytics/stats

Proposing 2 sample responses:

Sample Response 1:

GET _plugins/_security_analytics/stats

{
    "detectors": {
        "total": 5,
        "enabled": 3,
        "error": 1
    },
    "detectors_per_log_type": {
        "windows": {
           "total": 2,
           "enabled": 2,
            "error": 0
        },
        "linux": {
            "total": 2,
            "enabled": 1,
            "error": 1
            },
        "custom_log_1": {
            "total": 1,
            "enabled": 1,
            "error": 0
            },
        .
        .
        .
    },
 "custom_rules": 10,
 "custom_rules_per_log_type": {
    "windows": 5,
    "linux": 3,
    "custom_log_1": 1,
    .
    .
    .
 },
 "custom_log_types": 4
}

When there are no detectors or no custom logs defined, the above response would look like:

GET _plugins/_security_analytics/stats

{
    "detectors": {
        "total": 0,
        "enabled": 0,
        "error": 0
    },
    "custom_rules": 0,
    "custom_log_types": 0
    "detectors_per_log_type": {},
    "custom_rules_per_log_type": {},

}

Sample Response 2:

Considering only detectors_per_log_type and having a sub field all to signify aggregated metrics for all log types consolidated.

GET _plugins/_security_analytics/stats

{
    "detectors_per_log_type": {
        "all": {
            "total": 5,
            "enabled": 3,
            "error": 1
        },
        "windows": {
            "total": 2,
            "enabled": 2,
            "error": 0
        },
        "linux": {
            "total": 2,
            "enabled": 1,
            "error": 1
        },
        "custom_log_1": {
            "total": 1,
            "enabled": 1,
            "error": 0
        },
        .
        .
        .
    },
    "custom_rules_per_log_type": {
        "all": 10,
        "windows": 5,
        "linux": 3,
        "custom_log_1": 1,
        .
        .
        .
    },
    "custom_log_types": 4
}

When there are no detectors or no custom logs defined, the above response would look like:

GET _plugins/_security_analytics/stats

{
    "detectors_per_log_type": {
        "all": {
            "total": 0,
            "enabled": 0,
            "error": 0
        }
    },
    "custom_rules_per_log_type": {
        "all": 0
    },
    "custom_log_types": 0
}

Proposed Response

I propose Sample Response 1 over the other as it is much more cleaner implementation. The drawback with Sample Response 2 is that when iteration over different log types in the response object, one may have to purposely check and omit all type which can be confusing. Also, users who are parsing this information for metric collection and they do not need information at log type granularity can choose to omit detectors_per_log_type and custom_rules_per_log_type entirely

Future Improvements

If we require node level metrics, the same can be implemented in future with an additional parameter in request body:

 GET _plugins/_security_analytics/stats?include_advanced_metrics  

The scope of this advanced metrics can be decided after the implementation of API proposed above. The idea is to keep the default API behavior light-weight as collecting the information at node level granularity will be an expensive task which will linearly scale for large clusters depending upon their node count and most users may not need those metrics for their usage.