ps48 commented 3 years ago

OpenSearch & Dashboard Plugins Health Checks

1. Introduction

Health Check Page and APIs will be one stop shop, to check health status of all OpenSearch & OpenSearch-Dashboard plugins. The page will provide health status indicators (Green/Red/Yellow) for each plugin. The API endpoints will provide users the capability to check each plugin and get their detailed health status. Further, the health check responses from each plugin will be stored as logs for analysis.

2. Motivation

As OpenSearch keeps on adding more plugins (Community & Core Team Driven), it becomes hard to manage and coordinate them. Since each plugin runs independently, even if one of them fails the other plugins keep on running. This failure of one plugin may lead to a major catastrophe of the cluster later(E.g If a major plugin like notifications fails, rest of all dependent plugins are affected). Therefore, it would be great to see recent-past health checks of individual plugins, for a better drill down of problems in an OpenSearch cluster. This will help to find the root cause of instance failures if they were generated by a plugin failure. The health check page/logs/APIs would allow users to:

Get a glance of health status of individual plugins
Begin a Root Cause Analysis on a plugin issue
Explore past health check status of individual plugins

2.1 Target Users

Our target users are DevOps Teams and Engineers who manage/support deployed OpenSearch Clusters.

2.2 How is it Different from current status page in Kibana?

The current status page in Kibana provides health indicators for each plugin and a single API endpoint for accessing the plugin status. This page shows indicators based on availability of dependencies for each plugin. The following are issues with current status page:

Doesn’t report any other internal plugin issue apart from dependencies
Doesn’t provide individual API endpoints for each plugin
Doesn’t store detailed logs for recent past plugin statuses

3. Placement of Health Check page

The Health Check page can be integrated in our current OpenSearch setup in either of the two ways:

(Preferred) Integrate with current Status API available in OpenSearch-Dashboard. This integration will place health checks in Dashboard core. Further explained in Appendix 10.1. This method is preferred but, if this leads to backward compatibility issues we should fall back to the second method below.
Make health check a separate plugin with optionally an OpenSearch-Dashboard page.

4. Requirements

4.1 Required Features:

Heath Status of each plugin should be indicated by either of the 3 colors below:
1. GREEN → Indicates the plugin is alive and running
2. YELLOW → Indicates the plugin is alive, but has some issues like:
  1. A plugin might be in initialization stage
  2. A dependency might not be loaded
  3. A plugin might have index issues
  4. Any other misc. issue (plugin specific or node/cluster level)
3. RED → Indicates the plugin is not alive
There should be two types of health checks to perform different checks:
1. Startup check - This check makes sure that the plugin has initialized(after a start/restart), doesn’t have any internal issues and is ready to take incoming requests
2. Liveness check - This check makes sure that the plugin is still running post startup, doesn’t have any internal issues and is ready to take incoming requests
3. Readiness check - This check makes sure that the plugin doesn’t have an internal issues and is ready for incoming requests (merged in Startup and Liveness check)
There should be three types of triggers, to initiate the checks:
1. Startup: Called during startup
2. Time-Based: Called by time interval
3. Manual: Called at will
The health checks should be done at regular intervals (e.g once every 2 hours) and also could be called at will.
The heath checks should be run with minimum cpu/memory overhead, to not have any impact on performance.
The health status of each plugin should be logged and stored after each check cycle. The storage should be limited to past few days/weeks/months. Once the old logs expire they should be auto-deleted.
The health status plugin should enable users to see all installed plugins along with their health status indicators.
Apart from plugins, the health check page should also indicate the status of OpenSearch service or any other core service.
If not auto-triggered, the health checks should be able to get triggered manually via API endpoint.
The health checks triggered within each plugin should be completed in a stipulated time.
The health checks should be performed at the node level.
A user should be able to view detailed health check logs in a tabular format for a given time period.
It should be mandatory for each plugin to provide health check endpoints.
The health check page/plugin should have its own endpoints for self-health checks.

4.2 Optional Features:

The health check plots of past health metrics can be shown on the same page.
Managing logs (insertion, deletion and permissions) would be easier if they are stored as an OpenSearch index. But, this can have adverse effects on cluster performance as the I/O operations can be huge. This depends on how many health check requests are made, how often is the auto-deletion setup and how often are the logs accessed from the index.

4.3 Required Configuration:

// Following are Configuration options required for health check page
// The defaults for each of them can be adjusted based on: 
// 1. Average time taken by each plugin to startup
// 2. Max response time taken by each plugin
// 3. Average number of times a plugin responds with "Waiting" message

// NOTE: The configurations are global i.e. they apply health check requests made to all plugins. 

"initialDelaySeconds": Waiting period before sending first startup check request 
"pingWaitSeconds": Waiting period before re-pinging a plugin
"maxPingLimit": Max number of times a request call can be repeated 
"requestTimeOutSeconds": Max Time for a request completion 
"triggerIntervalSeconds": Time interval between automated health checks

5. Workflows

The health check page, as part of OpensSearch-Dashboard should be responsible for sending health check requests, logging reponses and updating indicators & plots.
The plugins should be responsible for providing API endpoints and doing their internal checks before responding to any health check requests.
If response to any health request is “Waiting“ or “Initializing”, the health check plugin re-pings after every configured ”pingWaitSeconds“. But, after a given number of tries configured as "maxPingLimit" the page will mark plugin indicator as RED and stop making requests to the plugin.
Initialization Trigger:
- At first, all the plugin indicators are turned RED by the health check page. The health page then waits for a certain amount of time before requesting each plugin. The initial wait time is configured as “initialDelaySeconds”.
- The health check page sends a “startup check” async request to each plugin. If a plugin responds with “Waiting” (plugin may be waiting for a dependency to load) or “Initializing” (plugin may be in warm-up stage) response, the indicators are from RED to YELLOW.
- Once, the plugins have completed initialization they send an “initialized“ response to the health page. The indicator of these plugins are turned YELLOW to GREEN.
- If the response from any plugin is not received/contains error the indicator for the plugin is turned RED.
Timely/Manual trigger:
- For time based or manual triggers, the health check page sends a “liveness check” async request to all the plugins. This request is accompanied by a payload of trigger type (Manual or Time-Based).
- If the response of ”liveness check“ is an error, the indicator for that plugin is turned RED.
- If the response of the “liveness check” is “waiting“, the indicator for that plugin is turned YELLOW.
- If the response from a plugin is “alive”, then the indicator stays GREEN.

6. API Design

6.1 APIs provided by plugin

The APIs below use Notebooks plugin as an example for making requests. Each health check type gets a separate API endpoint on each plugin. The API responses should contain the availability of dependencies(available or not available) and health of indices(green, yellow or red) used by the plugin. In case there are no dependencies or indices used by a plugin, these arrays can be kept empty in response. The final "customMessage" object can used by plugins to add additional information, that may be helpful in debugging plugin specific issues.

Startup Check

// Dashboards Plugin
GET api/health/startup/<plugin id> 

// OpenSearch Plugin
GET _plugins/_health/startup/<plugin id>

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Waiting", // indicator is turned YELLOW, Health Check Page re-pings later
            "description": "Waiting for dashboard plugin to initialize",
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Initializing", // indicator is turned YELLOW, Health Check Page re-pings later 
            "description": "Initializing internal components" 
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Initialized", // indicator is turned GREEN, plugin is accepting traffic
            "description": "Accepting Traffic" 
            "dependencies": [
                              {"dependency1": "loaded"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "yellow"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Error", // Health Check Page keeps plugin indicator RED
            "description": "Internal error in starting the plugin" // Can be a custom error message 
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "yellow"}
                       ], 
            "customMessage":{}
          } 
}

Liveness Check

// Dashboards Plugin
POST api/health/liveness/<plugin id> 

// OpenSearch Plugin
POST _plugins/_health/liveness/<plugin id>

REQUEST BODY
{
    "triggerType": "Manual" // Can be "Time-Based" 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Alive", // indicator is turned GREEN, plugin is accepting traffic
            "description": "Accepting Traffic" 
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Waiting", // indicator is turned YELLOW, Health Check Page re-pings later
            "description": "Waiting for dashboard plugin to initialize",
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Error", // Health Check Page keeps plugin indicator RED
            "description": "Internal error in starting the plugin" // Can be a custom error message 
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

~~Readiness Check~~ NOTE: Merged with startup and liveness checks

POST api/notebooks/health/readiness 

REQUEST BODY
{
"triggerType": "Startup" // Can be "Manual" or "Time-Based", if checking post startup
}

RESPONSE BODY
{
"statusCode": 200,
"message": "Ready", // Health Check Page keeps turns indicator GREEN
"body": "Accepting Traffic" 
}

RESPONSE BODY
{
"statusCode": 200,
"message": "Waiting", // Health Check re-pings
"body": "Waiting for other requests to be completed" 
}

RESPONSE BODY
{
"statusCode": 500,
"message": "Error", // Health Check Page turns plugin indicator RED
"body": "Internal error in plugin" // Can be a custom error message
}

6.2 APIs provided by health check page

The health check page provides three types of APIs; Startup, Liveness & Readiness. These APIs make internal requests to all the plugins (explained above).

Startup Check

GET api/health/startup

RESPONSE BODY
{
   healthStatus: [
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notebooksDashboards@1.0.0",
            "healthCheckType": "Startup",
            "triggerType": "Startup",
            "pingCount": 5,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 500,
            "responseMessage": "Error",
            "responseDescription": "Internal error in starting the plugin",
            "indicator": "RED"  
        },
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notificationsDashboards@1.0.0",
            "healthCheckType": "Startup",
            "triggerType": "Startup",
            "pingCount": 3,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 200,
            "responseMessage": "Initialized",
            "responseDescription": "Accepting Traffic",
            "indicator": "GREEN"  
        },
    ]
}

Liveness Check

POST api/health/liveness 

REQUEST BODY
{
    "triggerType": "Manual" // or can be "Time-Based" 
}

RESPONSE BODY
{
   healthStatus: [
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notebooksDashboards@1.0.0",
            "healthCheckType": "Liveness",
            "triggerType": "Manual",
            "pingCount": 5,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 500,
            "responseMessage": "Error",
            "responseDescription": "Internal error in starting the plugin",
            "indicator": "RED"  
        },
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notificationsDashboards@1.0.0",
            "healthCheckType": "Liveness",
            "triggerType": "Manual",
            "pingCount": 3,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 200,
            "responseMessage": "Alive",
            "responseDescription": "Accepting Traffic",
            "indicator": "GREEN"  
        },
    ]
}

~~Readiness Check~~ NOTE: Merged with startup and liveness checks

POST api/health/readiness  

REQUEST BODY
{
    "triggerType": "Time-Based" // Or can be "Manual" or "Startup" 
}

RESPONSE BODY
{
   healthStatus: [
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notebooksDashboards@1.0.0",
            "healthCheckType": "Readiness",
            "triggerType": "Time-Based",
            "pingCount": 5,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 500,
            "responseMessage": "Error",
            "responseBody": "Internal error in starting the plugin",
            "indicator": "RED"  
        },
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notificationsDashboards@1.0.0",
            "healthCheckType": "Readiness",
            "triggerType": "Time-Based",
            "pingCount": 3,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 200,
            "responseMessage": "Ready",
            "responseBody": "Accepting traffic",
            "indicator": "GREEN"  
        },
    ]
}

7. Log Fields & Example

Each health check request made should be stored as logs. The logs will give users detailed feed of availability details and access to health check history.

“timestamp”: Time of logging the health check reponse
“nodeId”: Id of the node in cluster
“pluginId”: Id of the plugin being checked
“startupTime”: Time of initial startup for the plugin
“lastAlive”: Time when plugin was seen last alive (passed the liveness check)
“lastReady”: Time when plugin was seen last ready (passed the readiness check)
“healthCheckType”: Type of health check request → Startup/Liveness/Readiness
“triggerType”: Type of trigger for the health check → Startup/Manual/Time-Based
“pingCount”: Number of times the health check requests were re-pinged
“requestTimestamp”: Time of sending the current health check request
“responseTime”: Response time of plugin in seconds
“responseStatusCode”: Response status code
“responseMessage”: Response message sent by the plugin → Initialized/Waiting/Initializing/Ready/Alive/Error
“responseDescription”: Response reason attached to the message
“indicator”: Indicator status after the response was received

[
    {
        "timestamp": "2021-01-01T04:04:02Z", 
        "nodeId":"USpTGYaBSIKbgSUJR2Z9lg", 
        "pluginId": "notebooksDashboards@1.0.0",
        "startupTime": "2021-01-01T03:08:48Z",
        "lastAlive": "2021-01-01T03:10:00Z",
        "lastReady": "2021-01-01T03:10:01Z",
        "healthCheckType": "Liveness",
        "triggerType": "Time-Based",
        "pingCount": 3,  
        "requestTimestamp": "2021-01-01T04:02:59Z",
        "responseTime": 1.12, 
        "responseStatusCode": 200,
        "responseMessage": "Alive",
        "responseDescription": "Accepting Traffic",
        "indicator": "GREEN" 
    },
    {
        "timestamp": "2021-01-01T04:04:03Z",
        "nodeId":"USpTGYaBSIKbgSUJR2Z9lg", 
        "pluginId": "notificationsDashboards@1.0.0",
        "startupTime": "2021-01-01T03:08:49Z",
        "lastAlive": "2021-01-01T03:10:01Z",
        "lastReady": "2021-01-01T03:10:02Z",
        "healthCheckType": "Startup",
        "triggerType": "Startup",
        "pingCount": 1,
        "requestTimestamp": "2021-01-01T04:03:00Z",
        "responseTime": 1.01,
        "responseStatusCode": 200,
        "responseMessage": "Initialized",
        "responseDescription": "Accepting Traffic",
        "indicator": "GREEN" 

    }
]

8. Future

Users should be able to set alerts and notification based on change in plugin indicators.
The health check page should trigger a performance check. This will check cluster’s ingestion speed, cpu and memory usage.

9. References

Status.io: https://status.status.io/
New Relic Blog - Kubernetes Health Checks: https://newrelic.com/blog/how-to-relic/kubernetes-health-checks
AWS ECS health check: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_HealthCheck.html
Kubernetes Health Probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

10. Appendix

10.1 Integrating current Dashboard Status API with above proposed health check APIs

As mentioned in “section 3“, there are two ways to implement health check page. If we decide to go with merging current Dashboard Status API, we would need to do the following changes:
- The Dashboard Status API calls should invoke health check APIs for each plugin. This async request would follow “manual trigger” workflow explained in “section 5“. This can be implemented by calling the APIs provided by health check page for liveness and readiness.
- The current status API response for each plugin contains the fields: “id”, “message”, “since”, “state”, “icon” and “uicolor”. We should add “startupTime”, "lastAlive", "lastReady" fields to the response (These are explained in section 7). In addition, the already present "message" field should contain the response of plugin’s liveness/readiness check.
- The Dashboard Status Page should add configuration settings as mentioned in “section 4.3“. Optionally, we could add plots of past-health checks.
- Backward compatibility is to be kept in mind while adding these pointer to the status API. Changing API drastically would break current usage in existing plugins.

10.2 Is the startup/liveliness/readiness model inspired by another product? How do other systems similar to OpenSearch and OpeSearch Dashboards deal with plugin health?

The checks are inspired from Kubernetes health check. I haven’t seen any similar system having a well designed plugin health checks. But there few interesting things, like Splunk has a third-party health status dashboard (internally uses REST APIs) to monitor instances, data feed, resource usage, available index, etc. This can be something we can do in future with our health check plugins. Another Search platform Algolia has a monitoring system for its SaaS offering. This monitoring system implements HTTPS endpoints to monitor cluster level status requests. Datadog has a similar status page https://status.datadoghq.com/ and has a separate status page for third party integrations https://datadogintegrations.statuspage.io

10.3 How will this API integrate with other monitoring solutions that I may be using for my operations?

Users having their own monitoring solutions today, can easily consume these health check APIs. They can create a monitoring dashboard of their own and can call health checks at will, using these APIs. This is assuming their current solutions can integrate with new REST endpoints.

seraphjiang commented 3 years ago

Assuming the proposal is for Dashboards Plugin health only, not for OpenSearch Plugin.

Today, Dashboards status api aggregate all the plugin status. Status page just invoke status api. It is not clear how the new Kibana status api response expected in new proposal.

ps48 commented 3 years ago

@seraphjiang Thank you so much for the feedback 😄, I added an Appendix section for integration of health checks with current Dashboards status API. But in short, the status API would follow the "manual trigger" workflow and internally call the health check page APIs for liveness and readiness.

dblock commented 3 years ago

I like the spirit of the problem, and have lots of questions.

What are some anecdotes of how lack of plugin health has negatively affected operations? How would this proposal address those?
What are the motivations to solve this in identical ways for both OpenSearch and OpenSearch Dashboards?
Is the startup/liveliness/readiness model inspired by another product? How do other systems similar to OpenSearch and OpeSearch Dashboards deal with plugin health?
Why is health a REST API? Are there, or will there ever be, plugins that don't expose a REST interface at all?
How does the new API interact with existing status APIs?
Will existing plugins have to do a lot of work to add health checks or are they going to inherit a basic version automagically in some release of OpenSearch and OpenSearch Dashboards?
There's an assumptions that all plugins are installed everywhere. This may not always be true. The cluster can also be very heterogeneous.
There's the idea that plugins are independent and should have their own status. Is this what devops engineers really want?
A plugin might be dead on one node and very healthy on another. Do users really want to see plugin health? Node health? Will averaging health be a misleading metric?
How will this API integrate with other monitoring solutions that I may be using for my operations?
It looks like the API wants to be polled. This is potentially bad because in a dead state things start taking a long time, timing out, hanging, etc. I'd prefer my system to send events somewhere, and only poll for UX.

ps48 commented 3 years ago

Thanks @dblock for such detailed and intriguing questions. Below are my responses:

What are some anecdotes of how lack of plugin health has negatively affected operations? How would this proposal address those?
- Reporting ES internal dependency missing
- Index missing for trace analytics
- Alerting Plugin internal dependency missing
- SQL Issue due to bad index
- In the above shown issues, our current health checks page would have easily shown if the plugins have issues or not.
What are the motivations to solve this in identical ways for both OpenSearch and OpenSearch Dashboards?
- From a user point of view, a common interface is easier to consume.
Is the startup/liveliness/readiness model inspired by another product? How do other systems similar to OpenSearch and OpeSearch Dashboards deal with plugin health?
- The checks are inspired from Kubernetes health checks. I haven’t seen any similar system having a well designed plugin health checks. But there few interesting things, like Splunk has a third-party plugin health status dashboard (internally uses REST APIs) to monitor instances, data feed, resource usage, available index, etc. This can be something we can do in future with our health check plugins. Another Search platform Algolia has a monitoring system for its SaaS offering. This monitoring system implements HTTPS endpoints to monitor cluster level status requests. Datadog has a similar status page and has a separate status page for third party integrations
Why is health a REST API? Are there, or will there ever be, plugins that don't expose a REST interface at all?
- OpenSearch is already a RESTful application, hence it makes sense to keep health checks of plugins as REST APIs. As far as I know, currently all plugins have REST endpoints. Any new plugin coming should at least provide the health check APIs.
How does the new API interact with existing status APIs?
- I have explained this in Appendix 10.1. Do let me know if you see something missing there or want to know some specifics.
Will existing plugins have to do a lot of work to add health checks or are they going to inherit a basic version automagically in some release of OpenSearch and OpenSearch Dashboards?
- Exposing API endpoints shouldn’t be a lot of work. I would say major work will be around implementing their health checks and knowing where the plugins will tend to break. This will vary from plugin to plugin. Some plugins have access to indices, some have dependency on other internal plugins, in future some may have external dependencies as well. To bring in some nature of automagic-ness we can provide a library or a boilerplate code to begin with.
There's an assumptions that all plugins are installed everywhere. This may not always be true. The cluster can also be very heterogeneous.
- This is a great point. I understand the point that each plugin can be installed in a nodes, each node can have different roles in a cluster and Dashboard can be connected to a node or the whole cluster. But something I don’t understand is can plugins on different nodes have inter-dependency or access cross-node indices in the same cluster. I need to look into this more. Is there some documentation or guidance you can provide me.
There's the idea that plugins are independent and should have their own status. Is this what devops engineers really want?
- Plugins aren’t independent but they run independently. What I mean here if Notifications plugin fails still Alerting will keep on running, if Notebooks’ internal index has some issue the notebooks-dashboard plugin still keeps on running. Only when these plugin try to reach their dependency or index they start to push out error trace. This is where our health check page can come in handy. It would already show Alerting and Notebooks plugins Yellow in status and have detailed message from their health check response. However, I would love to know more from DevOps engineers too. It would be great if we could request them to comment on this thread and share their experience.
A plugin might be dead on one node and very healthy on another. Do users really want to see plugin health? Node health? Will averaging health be a misleading metric?
- This again comes down to cluster level vs. node level details(similar to “question 7”). Averaging is surely misleading, node level would make more sense for our health checks.
How will this API integrate with other monitoring solutions that I may be using for my operations?
- Users having their own monitoring solutions today, can easily consume these health check APIs. They can create a health monitoring dashboard of their own and can call health checks at will, using these APIs. This is assuming their current solutions can integrate with new REST endpoints.
It looks like the API wants to be polled. This is potentially bad because in a dead state things start taking a long time, timing out, hanging, etc. I'd prefer my system to send events somewhere, and only poll for UX.
- This is really a great design question. I initially thought of a pub-sub model, where plugins publish their status and messages in logs. The logs could stored as files, index or queues. The UX would just read from the logs. But, this model creates a new dependency on logs permissions and storage. If plugins have some issue writing/indexing their statuses then there would be no way to get the status results. I would love to discuss more on this.

anirudha commented 3 years ago

M1 : design review with feedback from other plugins and a PoC to validate with 1 plugin 9/15 M2 : work with all plugin team to adopt the lib and add implementation 10/15 M3: release candidate 10/30 @ps48 what do you say

anirudha commented 3 years ago

i would suggest we can start with the SQL plugin or notebooks plugin as a PoC?

ps48 commented 3 years ago

For the PoC let's start with Notebooks, this would help to create checks on both OpenSearch and Dashbord endpoints.

ylwu-amzn commented 3 years ago

Just summary our talk

Is it possible to merge the startup and readiness check into one request? We can tell whether the startup/readiness done from the result. That can save one extra request.
How can we push community user to follow this practice to add health check API, especially for existing plugins that want to migrate to OpenSearch? Is it possible to make the health check API stable like api/health/<plugin id>, so for new plugins we can check if they support health check or not easily.
Health check should not bring extra big load, heavy deep health check should be throttled
How to store healthy check log? Discuss more about the impact if health check log grows fast, will it impact cluster performance

ps48 commented 3 years ago

Thank you so much @ylwu-amzn, for these pointers. I'll talk to some more plugin owners and then finally merge these comments to the design document.

ps48 commented 2 years ago

Merged the above comments and different points from plugin owners in the design.

skkosuri-amzn commented 2 years ago

Few more

Each plugin will have a seperate API? If yes, are we creating too many API for this?
Whats the security model for these API? (Who is the targeted user ? what are his permissions?)

opensearch-project / opensearch-plugins

Plugin Health Checks #74

OpenSearch & Dashboard Plugins Health Checks

1. Introduction

2. Motivation

2.1 Target Users

2.2 How is it Different from current status page in Kibana?

3. Placement of Health Check page

4. Requirements

5. Workflows

6. API Design

7. Log Fields & Example

8. Future

9. References

10. Appendix