Open ps48 opened 3 years ago
Assuming the proposal is for Dashboards Plugin health only, not for OpenSearch Plugin.
Today, Dashboards status api aggregate all the plugin status. Status page just invoke status api. It is not clear how the new Kibana status api response expected in new proposal.
@seraphjiang Thank you so much for the feedback 😄, I added an Appendix section for integration of health checks with current Dashboards status API. But in short, the status API would follow the "manual trigger" workflow and internally call the health check page APIs for liveness and readiness.
I like the spirit of the problem, and have lots of questions.
Thanks @dblock for such detailed and intriguing questions. Below are my responses:
What are some anecdotes of how lack of plugin health has negatively affected operations? How would this proposal address those?
What are the motivations to solve this in identical ways for both OpenSearch and OpenSearch Dashboards?
Is the startup/liveliness/readiness model inspired by another product? How do other systems similar to OpenSearch and OpeSearch Dashboards deal with plugin health?
Why is health a REST API? Are there, or will there ever be, plugins that don't expose a REST interface at all?
How does the new API interact with existing status APIs?
Will existing plugins have to do a lot of work to add health checks or are they going to inherit a basic version automagically in some release of OpenSearch and OpenSearch Dashboards?
There's an assumptions that all plugins are installed everywhere. This may not always be true. The cluster can also be very heterogeneous.
There's the idea that plugins are independent and should have their own status. Is this what devops engineers really want?
A plugin might be dead on one node and very healthy on another. Do users really want to see plugin health? Node health? Will averaging health be a misleading metric?
How will this API integrate with other monitoring solutions that I may be using for my operations?
It looks like the API wants to be polled. This is potentially bad because in a dead state things start taking a long time, timing out, hanging, etc. I'd prefer my system to send events somewhere, and only poll for UX.
M1 : design review with feedback from other plugins and a PoC to validate with 1 plugin 9/15 M2 : work with all plugin team to adopt the lib and add implementation 10/15 M3: release candidate 10/30 @ps48 what do you say
i would suggest we can start with the SQL plugin or notebooks plugin as a PoC?
For the PoC let's start with Notebooks, this would help to create checks on both OpenSearch and Dashbord endpoints.
Just summary our talk
api/health/<plugin id>
, so for new plugins we can check if they support health check or not easily. Thank you so much @ylwu-amzn, for these pointers. I'll talk to some more plugin owners and then finally merge these comments to the design document.
Merged the above comments and different points from plugin owners in the design.
Few more
OpenSearch & Dashboard Plugins Health Checks
1. Introduction
Health Check Page and APIs will be one stop shop, to check health status of all OpenSearch & OpenSearch-Dashboard plugins. The page will provide health status indicators (Green/Red/Yellow) for each plugin. The API endpoints will provide users the capability to check each plugin and get their detailed health status. Further, the health check responses from each plugin will be stored as logs for analysis.
2. Motivation
As OpenSearch keeps on adding more plugins (Community & Core Team Driven), it becomes hard to manage and coordinate them. Since each plugin runs independently, even if one of them fails the other plugins keep on running. This failure of one plugin may lead to a major catastrophe of the cluster later(E.g If a major plugin like notifications fails, rest of all dependent plugins are affected). Therefore, it would be great to see recent-past health checks of individual plugins, for a better drill down of problems in an OpenSearch cluster. This will help to find the root cause of instance failures if they were generated by a plugin failure. The health check page/logs/APIs would allow users to:
2.1 Target Users
Our target users are DevOps Teams and Engineers who manage/support deployed OpenSearch Clusters.
2.2 How is it Different from current status page in Kibana?
The current status page in Kibana provides health indicators for each plugin and a single API endpoint for accessing the plugin status. This page shows indicators based on availability of dependencies for each plugin. The following are issues with current status page:
3. Placement of Health Check page
The Health Check page can be integrated in our current OpenSearch setup in either of the two ways:
4. Requirements
4.1 Required Features:
Readiness check - This check makes sure that the plugin doesn’t have an internal issues and is ready for incoming requests(merged in Startup and Liveness check)4.2 Optional Features:
4.3 Required Configuration:
5. Workflows
The health check page, as part of OpensSearch-Dashboard should be responsible for sending health check requests, logging reponses and updating indicators & plots.
The plugins should be responsible for providing API endpoints and doing their internal checks before responding to any health check requests.
If response to any health request is “Waiting“ or “Initializing”, the health check plugin re-pings after every configured ”
pingWaitSeconds
“. But, after a given number of tries configured as"maxPingLimit"
the page will mark plugin indicator as RED and stop making requests to the plugin.Initialization Trigger:
initialDelaySeconds
”.Timely/Manual trigger:
If the response from a plugin is “alive”, then the indicator stays GREEN.
6. API Design
6.1 APIs provided by plugin
The APIs below use Notebooks plugin as an example for making requests. Each health check type gets a separate API endpoint on each plugin. The API responses should contain the availability of dependencies(available or not available) and health of indices(green, yellow or red) used by the plugin. In case there are no dependencies or indices used by a plugin, these arrays can be kept empty in response. The final "customMessage" object can used by plugins to add additional information, that may be helpful in debugging plugin specific issues.
Readiness CheckNOTE: Merged with startup and liveness checks6.2 APIs provided by health check page
The health check page provides three types of APIs; Startup, Liveness & Readiness. These APIs make internal requests to all the plugins (explained above).
Readiness CheckNOTE: Merged with startup and liveness checks7. Log Fields & Example
Each health check request made should be stored as logs. The logs will give users detailed feed of availability details and access to health check history.
8. Future
9. References
10. Appendix
10.1 Integrating current Dashboard Status API with above proposed health check APIs
10.2 Is the startup/liveliness/readiness model inspired by another product? How do other systems similar to OpenSearch and OpeSearch Dashboards deal with plugin health?
10.3 How will this API integrate with other monitoring solutions that I may be using for my operations?