Managed Service Monitoring

mike-gangl commented 1 year ago

Managed Service Monitoring

"As an operator, i want to monitor the health of various Unity services"

NOTE: S3 bucket is defined here.

An example of the health dashboard from AWS:

Screenshot 2024-03-29 at 1 03 27 PM

per venue, you'd have something like:

Service	Current
Jupyterhub	🟢
Airflow	🟢
Data Catalog	🟢
Application Catalog	🟢

Market place options:

Juptyer (integrate)
SPS (integrate)
DS (buckets)
SDAP (integrate)

Each Service needs a health endpoint

includes Data Catalog and Application Catalog, app-pack-gen health metrics

UI team to design endpoint to show health dashboard https://apigateway-for-unity/project/venue/health

Question to answer during planning:

Where is the health dashboard hosted? on mgmt console? on uiux dashboard?
How do we add the health check endpoint into the marketplace description/output
how will mgmt console query the health endpoint
What format will the health endpoint return to be consumed by the dashboard
permissions issues with getting the health endpoints

Health check SSM params should be defined in the project venue as: /unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Acceptance Criteria

JuptyerHub is integrated into the marketplace
- includes health endpoint
Airflow/SPS is integrated into the marketplace
- includes health endpoint
U-DS data bucket is integrated into the marketplace
- includes health endpoint
Shared Service health endpoints
- takes the form: /unity/healthCheck/shared-services/
- Data Catalog
- Algorithm Catalog
- App-pack-gen (?)
- Process Mapper health check
A service can query health endpoints and generate or store json health response (e.g. every 5 minutes)
- don't overengineer this, could simply be a lambda that queries health endpoints and creates a json document that is stored in a bucket every 5 minutes. we can optimize later.
Unity-py client to request health status from venue endpoint and return results
UIUX developed dashboard for displaying health of existing services
- should respond to dynamic content
- should read the above generated json file (from a bucket, webservice, etc)

Work Tickets

Link to work tickets required to implement the epic

[ ] TBC (to be created)
[x] https://github.com/unity-sds/unity-cs/issues/367
[x] https://github.com/unity-sds/unity-cs/issues/370
[x] https://github.com/unity-sds/unity-cs/issues/374
[x] https://github.com/unity-sds/unity-data-services/issues/351
[x] https://github.com/unity-sds/unity-cs/issues/381

Dependencies

Other epics or outside tickets required for this to work

Associated Risks

links to risk issues associated with this epic

[ ] TBC

Out of scope but future work:

Historical health (last 7 days)
Alerting a user based on health
Aggregating health - e.g. another dashboards that can monitor health across all venues (e.g. MMO)
Degradation vs healthy/not healthy
Measuring 'uptime' of a service for SLA metrics

{
  "services": [
    {
      "service": "airflow",
      "landingPage":"https://unity.com/project/venue/processing/ui",
      "healthChecks": [
        {
          "status": "HEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    },
    {
      "service": "jupyter",
      "landingPage":"https://unity.com/project/venue/ads/jupyter",
      "healthChecks": [
        {
          "status": "HEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    },
    {
      "service": "otherService",
      "landingPage":"https://unity.com/project/venue/other_service",
      "healthChecks": [
        {
          "status": "UNHEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    }
  ]
}

In the future, we might add more detail to a healthcheck object, like date of check, error, or a subgraph of other dependencies (database health, api health).

This should also accommodate the 'historical' record we envision in the future- where multiple healthchecks can be shown (e.g. daily health) for a given service.

rtapella commented 7 months ago

Would like to see:

recent_healthy: stores the most recent timestamp of a HEALTHY response (same as “date” if status is HEALTHY)
Maybe : ”endpoint” or “source” to confirm what’s being checked?

mike-gangl commented 7 months ago

Think about: Authorization- who owns the username/password for hitting an authenticated endpoint. Multiple components for a service area

future: historical records and tracking 'events'

galenatjpl commented 7 months ago

@mike-gangl I updated the diagram and some descriptions, and some work tickets in the above description

mike-gangl commented 7 months ago

Updated to include SSM naming parameter:

/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

galenatjpl commented 7 months ago

@mike-gangl NOTE: the diagram above is slightly off at this time (still needs an update to have

anilnatha commented 6 months ago

Regarding the sample JSON Mike posted earlier. Would like to suggest minor changes.

Use camelcase for the keys.
Can we add a title field that is used to display the name of the service in the UI navbar and in the Health Dashboard data grid?
The landingpage URLs should include the protocol, https://...

anilnatha commented 6 months ago

Also, in the list of healthchecks, can it be assumed that these will be stored in descending order, i.e. the most recent health check is the zeroth element in that array?

mike-gangl commented 6 months ago

I don't think we plan on ordering the events by health check date.
title and service seem interchangeble?
yeah, the protocol should be a part of the entry. I was just lazy there.
as for camelCase, google agrees with you, and that's good enough for me.

mike-gangl commented 6 months ago

@hargitayjpl - see my comment above and the new format of the health check response you'll be writing. camelCase is really the only change, as i think you'll simply pass whatever healthcheck value was supplied by the application.

rtapella commented 6 months ago

We can use title and service interchangeably as long as we're happy using "service" as the "English" label for the service.

For the keys, if we use camelCase then we can parse them into title case (e.g., "Camel Case")

galenatjpl commented 6 months ago

@mike-gangl @hargitayjpl I think we really need these two formats:

Shared Services Account components: /unity/healthCheck/shared-services/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Venue account components: /unity/healthCheck/<PROJECT>/<VENUE>/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Brandon and I discussed this morning in a meeting, and we want the health components namespaced by what proj/venue they are in. If we simply use something like /unity/healthcheck/airflowUI, it will be ambiguous, and cause data overwrite issues..

mike-gangl commented 5 months ago

This is close to being complete. Lambda and crons for proof of concept, management API is up and exposed.

rtapella commented 5 months ago

rtapella commented 4 months ago

Waiting for the U-CS health-endpoint to be ready. Placeholder JSON is being used for the draft implementations of the clients:

https://github.com/unity-sds/unity-py/issues/86 https://github.com/unity-sds/unity-ui/issues/25

galenatjpl commented 2 months ago

@brianlee731 should we move this to the current release? This is almost done and I think some other service areas need to integrate into what U-CS built.

unity-sds / unity-project-management