unity-sds / unity-project-management

Container repo for project management (projects, epics, etc)
Apache License 2.0
0 stars 2 forks source link

Managed Service Monitoring #101

Open mike-gangl opened 1 year ago

mike-gangl commented 1 year ago

Managed Service Monitoring

"As an operator, i want to monitor the health of various Unity services"

Screenshot 2024-04-09 at 8 34 28 PM

NOTE: S3 bucket is defined here.

An example of the health dashboard from AWS:

Screenshot 2024-03-29 at 1 03 27 PM

per venue, you'd have something like:

Service Current
Jupyterhub 🟢
Airflow 🟢
Data Catalog 🟢
Application Catalog 🟢

Market place options:

Each Service needs a health endpoint

UI team to design endpoint to show health dashboard https://apigateway-for-unity/project/venue/health

Question to answer during planning:

Health check SSM params should be defined in the project venue as: /unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Acceptance Criteria

Work Tickets

Link to work tickets required to implement the epic

Dependencies

Other epics or outside tickets required for this to work

Associated Risks

links to risk issues associated with this epic

Out of scope but future work:


previous


This overlaps with the idea of Common metrics /logs aggregation service #92 .

How do we plan to monitor the deployed managed services. I think to evolve into a full multi-tenant system we need to make sure we are monitoring:

Health of a service Uptime of a service Degredation (health?) - if it's responding to requests, how fast does it respond?

Think of a single console that can monitor all of the managed services across multiple accounts. What does this look like? how are logs/metrics/events propagated to the "central" dashboard? Or does the dashboard reach into different accounts to view things?


mike-gangl commented 7 months ago

Simple health check response format to be supplied by healthResponse

{
  "services": [
    {
      "service": "airflow",
      "landingPage":"https://unity.com/project/venue/processing/ui",
      "healthChecks": [
        {
          "status": "HEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    },
    {
      "service": "jupyter",
      "landingPage":"https://unity.com/project/venue/ads/jupyter",
      "healthChecks": [
        {
          "status": "HEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    },
    {
      "service": "otherService",
      "landingPage":"https://unity.com/project/venue/other_service",
      "healthChecks": [
        {
          "status": "UNHEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    }
  ]
}

In the future, we might add more detail to a healthcheck object, like date of check, error, or a subgraph of other dependencies (database health, api health).

This should also accommodate the 'historical' record we envision in the future- where multiple healthchecks can be shown (e.g. daily health) for a given service.

rtapella commented 7 months ago

Would like to see:

mike-gangl commented 7 months ago

Think about: Authorization- who owns the username/password for hitting an authenticated endpoint. Multiple components for a service area

future: historical records and tracking 'events'

galenatjpl commented 7 months ago

@mike-gangl I updated the diagram and some descriptions, and some work tickets in the above description

mike-gangl commented 7 months ago

Updated to include SSM naming parameter:

/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

galenatjpl commented 7 months ago

@mike-gangl NOTE: the diagram above is slightly off at this time (still needs an update to have

anilnatha commented 6 months ago

Regarding the sample JSON Mike posted earlier. Would like to suggest minor changes.

  1. Use camelcase for the keys.
  2. Can we add a title field that is used to display the name of the service in the UI navbar and in the Health Dashboard data grid?
  3. The landingpage URLs should include the protocol, https://...
anilnatha commented 6 months ago

Also, in the list of healthchecks, can it be assumed that these will be stored in descending order, i.e. the most recent health check is the zeroth element in that array?

mike-gangl commented 6 months ago
  1. I don't think we plan on ordering the events by health check date.
  2. title and service seem interchangeble?
  3. yeah, the protocol should be a part of the entry. I was just lazy there.
  4. as for camelCase, google agrees with you, and that's good enough for me.
mike-gangl commented 6 months ago

@hargitayjpl - see my comment above and the new format of the health check response you'll be writing. camelCase is really the only change, as i think you'll simply pass whatever healthcheck value was supplied by the application.

rtapella commented 6 months ago

We can use title and service interchangeably as long as we're happy using "service" as the "English" label for the service.

For the keys, if we use camelCase then we can parse them into title case (e.g., "Camel Case")

galenatjpl commented 6 months ago

@mike-gangl @hargitayjpl I think we really need these two formats:

Shared Services Account components: /unity/healthCheck/shared-services/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Venue account components: /unity/healthCheck/<PROJECT>/<VENUE>/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Brandon and I discussed this morning in a meeting, and we want the health components namespaced by what proj/venue they are in. If we simply use something like /unity/healthcheck/airflowUI, it will be ambiguous, and cause data overwrite issues..

mike-gangl commented 5 months ago

This is close to being complete. Lambda and crons for proof of concept, management API is up and exposed.

rtapella commented 5 months ago

Related UI work: https://github.com/unity-sds/unity-ui/pull/32

rtapella commented 4 months ago

Waiting for the U-CS health-endpoint to be ready. Placeholder JSON is being used for the draft implementations of the clients:

https://github.com/unity-sds/unity-py/issues/86 https://github.com/unity-sds/unity-ui/issues/25

galenatjpl commented 2 months ago

@brianlee731 should we move this to the current release? This is almost done and I think some other service areas need to integrate into what U-CS built.