Open mike-gangl opened 1 year ago
Simple health check response format to be supplied by healthResponse
{
"services": [
{
"service": "airflow",
"landingPage":"https://unity.com/project/venue/processing/ui",
"healthChecks": [
{
"status": "HEALTHY",
"date": "2024-04-09T18:01:08Z"
}
]
},
{
"service": "jupyter",
"landingPage":"https://unity.com/project/venue/ads/jupyter",
"healthChecks": [
{
"status": "HEALTHY",
"date": "2024-04-09T18:01:08Z"
}
]
},
{
"service": "otherService",
"landingPage":"https://unity.com/project/venue/other_service",
"healthChecks": [
{
"status": "UNHEALTHY",
"date": "2024-04-09T18:01:08Z"
}
]
}
]
}
In the future, we might add more detail to a healthcheck object, like date of check, error, or a subgraph of other dependencies (database health, api health).
This should also accommodate the 'historical' record we envision in the future- where multiple healthchecks can be shown (e.g. daily health) for a given service.
Would like to see:
Think about: Authorization- who owns the username/password for hitting an authenticated endpoint. Multiple components for a service area
future: historical records and tracking 'events'
@mike-gangl I updated the diagram and some descriptions, and some work tickets in the above description
Updated to include SSM naming parameter:
/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>
@mike-gangl NOTE: the diagram above is slightly off at this time (still needs an update to have
Regarding the sample JSON Mike posted earlier. Would like to suggest minor changes.
title
field that is used to display the name of the service in the UI navbar and in the Health Dashboard data grid?landingpage
URLs should include the protocol, https://..
.Also, in the list of healthchecks
, can it be assumed that these will be stored in descending order, i.e. the most recent health check is the zeroth element in that array?
@hargitayjpl - see my comment above and the new format of the health check response you'll be writing. camelCase is really the only change, as i think you'll simply pass whatever healthcheck value was supplied by the application.
We can use title and service interchangeably as long as we're happy using "service" as the "English" label for the service.
For the keys, if we use camelCase then we can parse them into title case (e.g., "Camel Case")
@mike-gangl @hargitayjpl I think we really need these two formats:
Shared Services Account components:
/unity/healthCheck/shared-services/<MARKETPLACE_ITEM>/<COMPONENT_NAME>
Venue account components:
/unity/healthCheck/<PROJECT>/<VENUE>/<MARKETPLACE_ITEM>/<COMPONENT_NAME>
Brandon and I discussed this morning in a meeting, and we want the health components namespaced by what proj/venue they are in. If we simply use something like /unity/healthcheck/airflowUI, it will be ambiguous, and cause data overwrite issues..
This is close to being complete. Lambda and crons for proof of concept, management API is up and exposed.
Related UI work: https://github.com/unity-sds/unity-ui/pull/32
Waiting for the U-CS health-endpoint to be ready. Placeholder JSON is being used for the draft implementations of the clients:
https://github.com/unity-sds/unity-py/issues/86 https://github.com/unity-sds/unity-ui/issues/25
@brianlee731 should we move this to the current release? This is almost done and I think some other service areas need to integrate into what U-CS built.
Managed Service Monitoring
"As an operator, i want to monitor the health of various Unity services"
NOTE: S3 bucket is defined here.
An example of the health dashboard from AWS:
per venue, you'd have something like:
Market place options:
Each Service needs a health endpoint
UI team to design endpoint to show health dashboard https://apigateway-for-unity/project/venue/health
Question to answer during planning:
Health check SSM params should be defined in the project venue as:
/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>
Acceptance Criteria
Work Tickets
Link to work tickets required to implement the epic
Dependencies
Other epics or outside tickets required for this to work
Associated Risks
links to risk issues associated with this epic
Out of scope but future work:
previous
This overlaps with the idea of Common metrics /logs aggregation service #92 .
How do we plan to monitor the deployed managed services. I think to evolve into a full multi-tenant system we need to make sure we are monitoring:
Health of a service Uptime of a service Degredation (health?) - if it's responding to requests, how fast does it respond?
Think of a single console that can monitor all of the managed services across multiple accounts. What does this look like? how are logs/metrics/events propagated to the "central" dashboard? Or does the dashboard reach into different accounts to view things?