[FEATURE] Stats API - Githubissues

eirsep commented 1 year ago

We need a stats API that gives insights into health and analytics of plugin usage. Stats can tell us how many detector/rule creation failures, success have occurred at a node level.

petardz commented 1 year ago

One of the interesting information for user could be "progress" of detectors, for example, if monitors are keeping up with log indices ingestion doc rate. There is issue created on alerting repo for implementation of Monitor Explain API which could be used for this: https://github.com/opensearch-project/alerting/issues/751

sandeshkr419 commented 1 year ago

Thinking on the possible API structure.

Scope of stats API:

Metrics related to success / failures of detectors, at cluster and node level level.
Information related to correlation engine is not shared as part of stats API since it is presently in experimental stage itself.

Path and HTTP methods

GET _plugins/_security_analytics/stats
GET _plugins/_security_analytics/stats/<metric>
GET _plugins/_security_analytics/<node-id>/stats
GET _plugins/_security_analytics/<node-id>/stats/<metric>

URL Parameters

node-id: node-id of the node for which the stats are required metric: detectors, detectors_per_log_type, custom_rules, custom_rules_per_log_type

Response

TBD after Response Body Fields review

Response Body Fields

Cluster Level Statistics

Field Name	Description
nodes	number of total, successful, failed nodes returned in the response.
cluster_name	cluster’s name.
cluster_uuid	cluster’s uuid.
timestamp	unix epoch time of when the cluster was last refreshed.
status	The cluster’s health status.
plugin_enabled	whether security analytics plugin is enabled or not
detectors	details (enabled, defined, in_error) of detectors
detectors_per_log_type	details (enabled, defined, in_error) of detectors in each log type
enabled, defined, error	stats of detectors as part of detectors and detectors_per_log_type metric
custom_rules	number of custom rules defined
custom_rules_per_log_type	number of custom rules defined per log type

Node Level Statistics

The node level statistics will be calculated at individual node level and will be aggregated over all nodes as well for a holistic overview.

Field Name	Description
roles	node roles: cluster_manager, data, etc
shards_analyzed	shards spanned by enabled detectors
total_documents	total documents in scope of detectors
documents_processed	documents scanned by detectors
documents_behind	number of documents in a node that are yet to be processed
rules_matched	rules matched by detectors
jobs_started_on_time	detectors started on time on that node

Task Breakthrough

The plan is to get a working API with minimal information ready and then add on statistics as required.

Implement Cluster Level Statistics
Implement Node Level Statistics
[Will create a separate issue] UI / Dashboard Changes
[Will create a separate issue] Documentation changes

References

Used the below APIs to decide on structure of stats API here.

eirsep commented 1 year ago

can you post an example response?

sandeshkr419 commented 12 months ago

@eirsep Sure. After re-iterating through the request and responses, here is the updated proposal. I have limited the response objects to make it look more cleaner and avoid unnecessary information in the first implementation of stats API.

Request:

GET _plugins/_security_analytics/stats

Proposing 2 sample responses:

Sample Response 1:

GET _plugins/_security_analytics/stats

{
    "detectors": {
        "total": 5,
        "enabled": 3,
        "error": 1
    },
    "detectors_per_log_type": {
        "windows": {
           "total": 2,
           "enabled": 2,
            "error": 0
        },
        "linux": {
            "total": 2,
            "enabled": 1,
            "error": 1
            },
        "custom_log_1": {
            "total": 1,
            "enabled": 1,
            "error": 0
            },
        .
        .
        .
    },
 "custom_rules": 10,
 "custom_rules_per_log_type": {
    "windows": 5,
    "linux": 3,
    "custom_log_1": 1,
    .
    .
    .
 },
 "custom_log_types": 4
}

When there are no detectors or no custom logs defined, the above response would look like:

GET _plugins/_security_analytics/stats

{
    "detectors": {
        "total": 0,
        "enabled": 0,
        "error": 0
    },
    "custom_rules": 0,
    "custom_log_types": 0
    "detectors_per_log_type": {},
    "custom_rules_per_log_type": {},

}

Sample Response 2:

Considering only detectors_per_log_type and having a sub field all to signify aggregated metrics for all log types consolidated.

GET _plugins/_security_analytics/stats

{
    "detectors_per_log_type": {
        "all": {
            "total": 5,
            "enabled": 3,
            "error": 1
        },
        "windows": {
            "total": 2,
            "enabled": 2,
            "error": 0
        },
        "linux": {
            "total": 2,
            "enabled": 1,
            "error": 1
        },
        "custom_log_1": {
            "total": 1,
            "enabled": 1,
            "error": 0
        },
        .
        .
        .
    },
    "custom_rules_per_log_type": {
        "all": 10,
        "windows": 5,
        "linux": 3,
        "custom_log_1": 1,
        .
        .
        .
    },
    "custom_log_types": 4
}

When there are no detectors or no custom logs defined, the above response would look like:

GET _plugins/_security_analytics/stats

{
    "detectors_per_log_type": {
        "all": {
            "total": 0,
            "enabled": 0,
            "error": 0
        }
    },
    "custom_rules_per_log_type": {
        "all": 0
    },
    "custom_log_types": 0
}

Proposed Response

I propose Sample Response 1 over the other as it is much more cleaner implementation. The drawback with Sample Response 2 is that when iteration over different log types in the response object, one may have to purposely check and omit all type which can be confusing. Also, users who are parsing this information for metric collection and they do not need information at log type granularity can choose to omit detectors_per_log_type and custom_rules_per_log_type entirely

Future Improvements

If we require node level metrics, the same can be implemented in future with an additional parameter in request body:

 GET _plugins/_security_analytics/stats?include_advanced_metrics

The scope of this advanced metrics can be decided after the implementation of API proposed above. The idea is to keep the default API behavior light-weight as collecting the information at node level granularity will be an expensive task which will linearly scale for large clusters depending upon their node count and most users may not need those metrics for their usage.

opensearch-project / security-analytics

[FEATURE] Stats API #362