opensearch-project / opensearch-plugins

For all things OpenSearch plugins. You want to install, or develop a plugin? You've come to the right place.
Apache License 2.0
49 stars 61 forks source link

Plugin Health Checks #74

Open ps48 opened 3 years ago

ps48 commented 3 years ago

OpenSearch & Dashboard Plugins Health Checks

1. Introduction

Health Check Page and APIs will be one stop shop, to check health status of all OpenSearch & OpenSearch-Dashboard plugins. The page will provide health status indicators (Green/Red/Yellow) for each plugin. The API endpoints will provide users the capability to check each plugin and get their detailed health status. Further, the health check responses from each plugin will be stored as logs for analysis.

2. Motivation

As OpenSearch keeps on adding more plugins (Community & Core Team Driven), it becomes hard to manage and coordinate them. Since each plugin runs independently, even if one of them fails the other plugins keep on running. This failure of one plugin may lead to a major catastrophe of the cluster later(E.g If a major plugin like notifications fails, rest of all dependent plugins are affected). Therefore, it would be great to see recent-past health checks of individual plugins, for a better drill down of problems in an OpenSearch cluster. This will help to find the root cause of instance failures if they were generated by a plugin failure. The health check page/logs/APIs would allow users to:

  1. Get a glance of health status of individual plugins
  2. Begin a Root Cause Analysis on a plugin issue
  3. Explore past health check status of individual plugins

2.1 Target Users

Our target users are DevOps Teams and Engineers who manage/support deployed OpenSearch Clusters.

2.2 How is it Different from current status page in Kibana?

The current status page in Kibana provides health indicators for each plugin and a single API endpoint for accessing the plugin status. This page shows indicators based on availability of dependencies for each plugin. The following are issues with current status page:

  1. Doesn’t report any other internal plugin issue apart from dependencies
  2. Doesn’t provide individual API endpoints for each plugin
  3. Doesn’t store detailed logs for recent past plugin statuses

3. Placement of Health Check page

The Health Check page can be integrated in our current OpenSearch setup in either of the two ways:

  1. (Preferred) Integrate with current Status API available in OpenSearch-Dashboard. This integration will place health checks in Dashboard core. Further explained in Appendix 10.1. This method is preferred but, if this leads to backward compatibility issues we should fall back to the second method below.
  2. Make health check a separate plugin with optionally an OpenSearch-Dashboard page.

4. Requirements

4.1 Required Features:

  1. Heath Status of each plugin should be indicated by either of the 3 colors below:
    1. GREEN → Indicates the plugin is alive and running
    2. YELLOW → Indicates the plugin is alive, but has some issues like:
      1. A plugin might be in initialization stage
      2. A dependency might not be loaded
      3. A plugin might have index issues
      4. Any other misc. issue (plugin specific or node/cluster level)
    3. RED → Indicates the plugin is not alive
  2. There should be two types of health checks to perform different checks:
    1. Startup check - This check makes sure that the plugin has initialized(after a start/restart), doesn’t have any internal issues and is ready to take incoming requests
    2. Liveness check - This check makes sure that the plugin is still running post startup, doesn’t have any internal issues and is ready to take incoming requests
    3. Readiness check - This check makes sure that the plugin doesn’t have an internal issues and is ready for incoming requests (merged in Startup and Liveness check)
  3. There should be three types of triggers, to initiate the checks:
    1. Startup: Called during startup
    2. Time-Based: Called by time interval
    3. Manual: Called at will
  4. The health checks should be done at regular intervals (e.g once every 2 hours) and also could be called at will.
  5. The heath checks should be run with minimum cpu/memory overhead, to not have any impact on performance.
  6. The health status of each plugin should be logged and stored after each check cycle. The storage should be limited to past few days/weeks/months. Once the old logs expire they should be auto-deleted.
  7. The health status plugin should enable users to see all installed plugins along with their health status indicators.
  8. Apart from plugins, the health check page should also indicate the status of OpenSearch service or any other core service.
  9. If not auto-triggered, the health checks should be able to get triggered manually via API endpoint.
  10. The health checks triggered within each plugin should be completed in a stipulated time.
  11. The health checks should be performed at the node level.
  12. A user should be able to view detailed health check logs in a tabular format for a given time period.
  13. It should be mandatory for each plugin to provide health check endpoints.
  14. The health check page/plugin should have its own endpoints for self-health checks.

4.2 Optional Features:

  1. The health check plots of past health metrics can be shown on the same page.
  2. Managing logs (insertion, deletion and permissions) would be easier if they are stored as an OpenSearch index. But, this can have adverse effects on cluster performance as the I/O operations can be huge. This depends on how many health check requests are made, how often is the auto-deletion setup and how often are the logs accessed from the index.

4.3 Required Configuration:

// Following are Configuration options required for health check page
// The defaults for each of them can be adjusted based on: 
// 1. Average time taken by each plugin to startup
// 2. Max response time taken by each plugin
// 3. Average number of times a plugin responds with "Waiting" message

// NOTE: The configurations are global i.e. they apply health check requests made to all plugins. 

"initialDelaySeconds": Waiting period before sending first startup check request 
"pingWaitSeconds": Waiting period before re-pinging a plugin
"maxPingLimit": Max number of times a request call can be repeated 
"requestTimeOutSeconds": Max Time for a request completion 
"triggerIntervalSeconds": Time interval between automated health checks 

5. Workflows

6. API Design

6.1 APIs provided by plugin

The APIs below use Notebooks plugin as an example for making requests. Each health check type gets a separate API endpoint on each plugin. The API responses should contain the availability of dependencies(available or not available) and health of indices(green, yellow or red) used by the plugin. In case there are no dependencies or indices used by a plugin, these arrays can be kept empty in response. The final "customMessage" object can used by plugins to add additional information, that may be helpful in debugging plugin specific issues.

  1. Startup Check
// Dashboards Plugin
GET api/health/startup/<plugin id> 

// OpenSearch Plugin
GET _plugins/_health/startup/<plugin id>

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Waiting", // indicator is turned YELLOW, Health Check Page re-pings later
            "description": "Waiting for dashboard plugin to initialize",
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Initializing", // indicator is turned YELLOW, Health Check Page re-pings later 
            "description": "Initializing internal components" 
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Initialized", // indicator is turned GREEN, plugin is accepting traffic
            "description": "Accepting Traffic" 
            "dependencies": [
                              {"dependency1": "loaded"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "yellow"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Error", // Health Check Page keeps plugin indicator RED
            "description": "Internal error in starting the plugin" // Can be a custom error message 
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "yellow"}
                       ], 
            "customMessage":{}
          } 
}
  1. Liveness Check
// Dashboards Plugin
POST api/health/liveness/<plugin id> 

// OpenSearch Plugin
POST _plugins/_health/liveness/<plugin id>

REQUEST BODY
{
    "triggerType": "Manual" // Can be "Time-Based" 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Alive", // indicator is turned GREEN, plugin is accepting traffic
            "description": "Accepting Traffic" 
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Waiting", // indicator is turned YELLOW, Health Check Page re-pings later
            "description": "Waiting for dashboard plugin to initialize",
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}

RESPONSE BODY
{
    "statusCode": 200,
    "body":{
            "message": "Error", // Health Check Page keeps plugin indicator RED
            "description": "Internal error in starting the plugin" // Can be a custom error message 
            "dependencies": [
                              {"dependency1": "available"},
                              {"dependency2": "not available"}
                            ],
            "indices": [
                         {"index1": "green"},
                         {"index2": "green"}
                       ], 
            "customMessage":{}
          } 
}
  1. Readiness Check NOTE: Merged with startup and liveness checks
POST api/notebooks/health/readiness 

REQUEST BODY
{
"triggerType": "Startup" // Can be "Manual" or "Time-Based", if checking post startup
}

RESPONSE BODY
{
"statusCode": 200,
"message": "Ready", // Health Check Page keeps turns indicator GREEN
"body": "Accepting Traffic" 
}

RESPONSE BODY
{
"statusCode": 200,
"message": "Waiting", // Health Check re-pings
"body": "Waiting for other requests to be completed" 
}

RESPONSE BODY
{
"statusCode": 500,
"message": "Error", // Health Check Page turns plugin indicator RED
"body": "Internal error in plugin" // Can be a custom error message
}

6.2 APIs provided by health check page

The health check page provides three types of APIs; Startup, Liveness & Readiness. These APIs make internal requests to all the plugins (explained above).

  1. Startup Check
GET api/health/startup

RESPONSE BODY
{
   healthStatus: [
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notebooksDashboards@1.0.0",
            "healthCheckType": "Startup",
            "triggerType": "Startup",
            "pingCount": 5,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 500,
            "responseMessage": "Error",
            "responseDescription": "Internal error in starting the plugin",
            "indicator": "RED"  
        },
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notificationsDashboards@1.0.0",
            "healthCheckType": "Startup",
            "triggerType": "Startup",
            "pingCount": 3,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 200,
            "responseMessage": "Initialized",
            "responseDescription": "Accepting Traffic",
            "indicator": "GREEN"  
        },
    ]
}
  1. Liveness Check
POST api/health/liveness 

REQUEST BODY
{
    "triggerType": "Manual" // or can be "Time-Based" 
}

RESPONSE BODY
{
   healthStatus: [
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notebooksDashboards@1.0.0",
            "healthCheckType": "Liveness",
            "triggerType": "Manual",
            "pingCount": 5,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 500,
            "responseMessage": "Error",
            "responseDescription": "Internal error in starting the plugin",
            "indicator": "RED"  
        },
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notificationsDashboards@1.0.0",
            "healthCheckType": "Liveness",
            "triggerType": "Manual",
            "pingCount": 3,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 200,
            "responseMessage": "Alive",
            "responseDescription": "Accepting Traffic",
            "indicator": "GREEN"  
        },
    ]
}
  1. Readiness Check NOTE: Merged with startup and liveness checks
POST api/health/readiness  

REQUEST BODY
{
    "triggerType": "Time-Based" // Or can be "Manual" or "Startup" 
}

RESPONSE BODY
{
   healthStatus: [
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notebooksDashboards@1.0.0",
            "healthCheckType": "Readiness",
            "triggerType": "Time-Based",
            "pingCount": 5,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 500,
            "responseMessage": "Error",
            "responseBody": "Internal error in starting the plugin",
            "indicator": "RED"  
        },
        {
            "timestamp": "2021-01-01T04:04:02Z", 
            "pluginId": "notificationsDashboards@1.0.0",
            "healthCheckType": "Readiness",
            "triggerType": "Time-Based",
            "pingCount": 3,  
            "requestTimestamp": "2021-01-01T04:02:59Z",
            "responseTime": 1.01, 
            "statusCode": 200,
            "responseMessage": "Ready",
            "responseBody": "Accepting traffic",
            "indicator": "GREEN"  
        },
    ]
}

7. Log Fields & Example

Each health check request made should be stored as logs. The logs will give users detailed feed of availability details and access to health check history.

[
    {
        "timestamp": "2021-01-01T04:04:02Z", 
        "nodeId":"USpTGYaBSIKbgSUJR2Z9lg", 
        "pluginId": "notebooksDashboards@1.0.0",
        "startupTime": "2021-01-01T03:08:48Z",
        "lastAlive": "2021-01-01T03:10:00Z",
        "lastReady": "2021-01-01T03:10:01Z",
        "healthCheckType": "Liveness",
        "triggerType": "Time-Based",
        "pingCount": 3,  
        "requestTimestamp": "2021-01-01T04:02:59Z",
        "responseTime": 1.12, 
        "responseStatusCode": 200,
        "responseMessage": "Alive",
        "responseDescription": "Accepting Traffic",
        "indicator": "GREEN" 
    },
    {
        "timestamp": "2021-01-01T04:04:03Z",
        "nodeId":"USpTGYaBSIKbgSUJR2Z9lg", 
        "pluginId": "notificationsDashboards@1.0.0",
        "startupTime": "2021-01-01T03:08:49Z",
        "lastAlive": "2021-01-01T03:10:01Z",
        "lastReady": "2021-01-01T03:10:02Z",
        "healthCheckType": "Startup",
        "triggerType": "Startup",
        "pingCount": 1,
        "requestTimestamp": "2021-01-01T04:03:00Z",
        "responseTime": 1.01,
        "responseStatusCode": 200,
        "responseMessage": "Initialized",
        "responseDescription": "Accepting Traffic",
        "indicator": "GREEN" 

    }
]

8. Future

9. References

  1. Status.io: https://status.status.io/
  2. New Relic Blog - Kubernetes Health Checks: https://newrelic.com/blog/how-to-relic/kubernetes-health-checks
  3. AWS ECS health check: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_HealthCheck.html
  4. Kubernetes Health Probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

10. Appendix

10.1 Integrating current Dashboard Status API with above proposed health check APIs

10.2 Is the startup/liveliness/readiness model inspired by another product? How do other systems similar to OpenSearch and OpeSearch Dashboards deal with plugin health?

10.3 How will this API integrate with other monitoring solutions that I may be using for my operations?

seraphjiang commented 3 years ago

Assuming the proposal is for Dashboards Plugin health only, not for OpenSearch Plugin.

Today, Dashboards status api aggregate all the plugin status. Status page just invoke status api. It is not clear how the new Kibana status api response expected in new proposal.

ps48 commented 3 years ago

@seraphjiang Thank you so much for the feedback 😄, I added an Appendix section for integration of health checks with current Dashboards status API. But in short, the status API would follow the "manual trigger" workflow and internally call the health check page APIs for liveness and readiness.

dblock commented 3 years ago

I like the spirit of the problem, and have lots of questions.

ps48 commented 3 years ago

Thanks @dblock for such detailed and intriguing questions. Below are my responses:

  1. What are some anecdotes of how lack of plugin health has negatively affected operations? How would this proposal address those?

  2. What are the motivations to solve this in identical ways for both OpenSearch and OpenSearch Dashboards?

    • From a user point of view, a common interface is easier to consume.
  3. Is the startup/liveliness/readiness model inspired by another product? How do other systems similar to OpenSearch and OpeSearch Dashboards deal with plugin health?

    • The checks are inspired from Kubernetes health checks. I haven’t seen any similar system having a well designed plugin health checks. But there few interesting things, like Splunk has a third-party plugin health status dashboard (internally uses REST APIs) to monitor instances, data feed, resource usage, available index, etc. This can be something we can do in future with our health check plugins. Another Search platform Algolia has a monitoring system for its SaaS offering. This monitoring system implements HTTPS endpoints to monitor cluster level status requests. Datadog has a similar status page and has a separate status page for third party integrations
  4. Why is health a REST API? Are there, or will there ever be, plugins that don't expose a REST interface at all?

    • OpenSearch is already a RESTful application, hence it makes sense to keep health checks of plugins as REST APIs. As far as I know, currently all plugins have REST endpoints. Any new plugin coming should at least provide the health check APIs.
  5. How does the new API interact with existing status APIs?

    • I have explained this in Appendix 10.1. Do let me know if you see something missing there or want to know some specifics.
  6. Will existing plugins have to do a lot of work to add health checks or are they going to inherit a basic version automagically in some release of OpenSearch and OpenSearch Dashboards?

    • Exposing API endpoints shouldn’t be a lot of work. I would say major work will be around implementing their health checks and knowing where the plugins will tend to break. This will vary from plugin to plugin. Some plugins have access to indices, some have dependency on other internal plugins, in future some may have external dependencies as well. To bring in some nature of automagic-ness we can provide a library or a boilerplate code to begin with.
  7. There's an assumptions that all plugins are installed everywhere. This may not always be true. The cluster can also be very heterogeneous.

    • This is a great point. I understand the point that each plugin can be installed in a nodes, each node can have different roles in a cluster and Dashboard can be connected to a node or the whole cluster. But something I don’t understand is can plugins on different nodes have inter-dependency or access cross-node indices in the same cluster. I need to look into this more. Is there some documentation or guidance you can provide me.
  8. There's the idea that plugins are independent and should have their own status. Is this what devops engineers really want?

    • Plugins aren’t independent but they run independently. What I mean here if Notifications plugin fails still Alerting will keep on running, if Notebooks’ internal index has some issue the notebooks-dashboard plugin still keeps on running. Only when these plugin try to reach their dependency or index they start to push out error trace. This is where our health check page can come in handy. It would already show Alerting and Notebooks plugins Yellow in status and have detailed message from their health check response. However, I would love to know more from DevOps engineers too. It would be great if we could request them to comment on this thread and share their experience.
  9. A plugin might be dead on one node and very healthy on another. Do users really want to see plugin health? Node health? Will averaging health be a misleading metric?

    • This again comes down to cluster level vs. node level details(similar to “question 7”). Averaging is surely misleading, node level would make more sense for our health checks.
  10. How will this API integrate with other monitoring solutions that I may be using for my operations?

    • Users having their own monitoring solutions today, can easily consume these health check APIs. They can create a health monitoring dashboard of their own and can call health checks at will, using these APIs. This is assuming their current solutions can integrate with new REST endpoints.
  11. It looks like the API wants to be polled. This is potentially bad because in a dead state things start taking a long time, timing out, hanging, etc. I'd prefer my system to send events somewhere, and only poll for UX.

    • This is really a great design question. I initially thought of a pub-sub model, where plugins publish their status and messages in logs. The logs could stored as files, index or queues. The UX would just read from the logs. But, this model creates a new dependency on logs permissions and storage. If plugins have some issue writing/indexing their statuses then there would be no way to get the status results. I would love to discuss more on this.
anirudha commented 3 years ago

M1 : design review with feedback from other plugins and a PoC to validate with 1 plugin 9/15 M2 : work with all plugin team to adopt the lib and add implementation 10/15 M3: release candidate 10/30 @ps48 what do you say

anirudha commented 3 years ago

i would suggest we can start with the SQL plugin or notebooks plugin as a PoC?

ps48 commented 3 years ago

For the PoC let's start with Notebooks, this would help to create checks on both OpenSearch and Dashbord endpoints.

ylwu-amzn commented 3 years ago

Just summary our talk

  1. Is it possible to merge the startup and readiness check into one request? We can tell whether the startup/readiness done from the result. That can save one extra request.
  2. How can we push community user to follow this practice to add health check API, especially for existing plugins that want to migrate to OpenSearch? Is it possible to make the health check API stable like api/health/<plugin id>, so for new plugins we can check if they support health check or not easily.
  3. Health check should not bring extra big load, heavy deep health check should be throttled
  4. How to store healthy check log? Discuss more about the impact if health check log grows fast, will it impact cluster performance
ps48 commented 3 years ago

Thank you so much @ylwu-amzn, for these pointers. I'll talk to some more plugin owners and then finally merge these comments to the design document.

ps48 commented 2 years ago

Merged the above comments and different points from plugin owners in the design.

skkosuri-amzn commented 2 years ago

Few more

  1. Each plugin will have a seperate API? If yes, are we creating too many API for this?
  2. Whats the security model for these API? (Who is the targeted user ? what are his permissions?)