[Integrations]: Support for Observability of OpenSearch application server

Hi @anirudha,

As with any other application, we have started analyzing the OpenSearch server logs to find all the metrices. The problem with logs is, they are very random and do not follow some specific pattern, so it becomes difficult to apply some regex to find any values. In addition, the data available from the logs are not much helpful to gather many type of metric data, like the number of indices, data volume in each indices, etc.

Here is a sample from OpenSearch server logs

[2022-06-15T14:30:35,568][INFO ][o.o.p.PluginsService     ] [PSL-5CD1520ZT5] loaded module [rank-eval]
[2022-06-15T14:30:35,568][INFO ][o.o.p.PluginsService     ] [PSL-5CD1520ZT5] loaded module [reindex]
[2022-06-15T14:30:35,569][INFO ][o.o.p.PluginsService     ] [PSL-5CD1520ZT5] loaded module [repository-url]
[2022-06-15T14:30:35,570][INFO ][o.o.p.PluginsService     ] [PSL-5CD1520ZT5] loaded module [test-delayed-aggs]
[2022-06-15T14:30:35,571][INFO ][o.o.p.PluginsService     ] [PSL-5CD1520ZT5] loaded module [transport-netty4]
[2022-06-15T14:30:35,574][INFO ][o.o.p.PluginsService     ] [PSL-5CD1520ZT5] loaded plugin [opensearch-observability]
[2022-06-15T14:30:35,575][INFO ][o.o.p.PluginsService     ] [PSL-5CD1520ZT5] loaded plugin [opensearch-sql]
[2022-06-15T14:30:35,672][INFO ][o.o.e.NodeEnvironment    ] [PSL-5CD1520ZT5] using [1] data paths, mounts [[/mnt/d (drvfs)]], net usable_space [260.4gb], net total_space [276.3gb], types [9p]
[2022-06-15T14:30:35,673][INFO ][o.o.e.NodeEnvironment    ] [PSL-5CD1520ZT5] heap size [1gb], compressed ordinary object pointers [true]
[2022-06-15T14:30:36,654][INFO ][o.o.n.Node               ] [PSL-5CD1520ZT5] node name [PSL-5CD1520ZT5], node ID [O9ylm_SKQ0yXQbW205wnLA], cluster name [opensearch], roles [cluster_manager, remote_cluster_client, data, ingest]
[2022-06-15T14:30:36,993][WARN ][o.o.o.s.PluginSettings   ] [PSL-5CD1520ZT5] observability:Failed to load /mnt/d/opensearch-2.0.0-SNAPSHOT/config/opensearch-observability/observability.yml
[2022-06-15T14:30:42,569][INFO ][o.o.t.NettyAllocator     ] [PSL-5CD1520ZT5] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=256kb, factors={opensearch.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=1mb, heap_size=1gb}]"

So we started looking for alternative solutions. Here are few alternatives that we found which can help us deriving matrices:

Monitoring Opensearch cluster metrics with Amazon cloud Watch
Amazon CloudTrail
Writing custom application using Performance Analyzer plugin
OpenSearch Perftop utility

Monitoring Opensearch cluster metrics with Amazon cloud Watch

When running opensearch as an AWS service, CloudWatch can be configured to monitor OpenSearch resources in real time. This can help us collect all metrices data. We can collect and track metrics.

Advantage :

This service is already exists and we just need configure it to retrieve the collected metrices and send it to the fluentD and subsequently to OpenSearch to prepare Observability.

Disadvantage :

This works only when the OpenSearch service is working as an AWS service. People using standalone version of the application are not going to get the benefits.

We are stopping our further analysis on CloudWatch for now, as this requires an active AWS account.

Amazon CloudTrail

Similar to above, when OpenSearch is running as AWS service, Amazon CloudTrail can capture API calls to OpenSearch Service as Events. It can capture those events and write to an Amazon S3 buckets that we can specify in the configuration. Using this information, you can identify which users and accounts made requests, the source IP address from which the requests were made, and when the requests occurred.

Advantage :

This service is already exists and we just need configure it to retrieve the collected metrices and send it to the fluentD and subsequently to OpenSearch to prepare Observability.

Disadvantage:

This works only when the OpenSearch service is working as an AWS service. People using standalone version of the application are not going to get the benefits.
In addition, this does not capture all type of metrices, so the observability section will be limited.

We are stopping our further analysis on CloudTrail for now, as this requires an active AWS account.

Writing custom application using Performance Analyzer plugin

The Performance Analyzer plugin provides many RESTful APIs to fetch different metrics from OpenSearch. We can triger those API in specific interval of time. After reciving the data we can write it into the file or on an HTTP channel, from where fluentD can pick the data and forward to OpenSearch for creating Observability.

Advantages:

Many metrices are available using RESTful interface.
The matrices should be more accurate, as they are provided by OpenSearch service itself, as compared to any custom solution by capturing logs.

Disadvantage:

We have to either develop our custom application or use some existing open source application that can call APIs at some interval and collect data to take advantage of this. This can extend development time.

OpenSearch Perftop utility

The PerfTop CLI available in the OpenSearch project already uses the Performance Analyzer utility to fetch the pre-configured dashboards for analyzing OpenSearch clusters. Currently there is no way to forward this data to OpenSearch or any other service to create observability.

We can modify the perftop utility to provide an option to fetch all the metrices and write the output to a file instead of showing dashboards visually. The file can be used by fluentD to forward the data to OpenSearch to create observability.

Advantage:

We already have application from there we will easily get the opensearch metrices.
We just need to extend the current application to meet our requirement. The development efforts are going to be less than the previous option.

We do not see much disadvantages here, as the existing application is going to operate as is, and we are planning to provide additional options to write to a file. This should not ideally interfere with existing applications that are currently in use.

Let us know if we are missing something that can be considered along with the above. Or any advantages or disadnatages we are overlooking in the description above.

@spattnaik @abasatwar

opensearch-project / dashboards-observability