nforgeio / neonKUBE

Public NeonKUBE Kubernetes distribution related projects
https://neonkube.io
Apache License 2.0
78 stars 13 forks source link

Optimize cluster GetClusterStatusAsync() methods for cloud? #1554

Open jefflill opened 2 years ago

jefflill commented 2 years ago

We're calling IHostingManager.GetClusterStatusAsync() every 5 seconds right now from neonDESKTOP to update the task bar icon state.

This is quite costly for AWS because it requires listing all cluster VMs which won't be too bad for smaller clusters but will result in significant network traffic for larger clusters.

This is even worse for Azure because we need to list the cluster VMs and then perform also individual status queries for every VM!

Potential Optimization:

  1. Have neon-cluster-operator periodically query the API server for cluster node state and include this state in the cluster health status.
  2. Have hosting managers query the cluster health status first.
  3. Query the cloud VM state only when the cluster status call fails.

The nice thing about this approach is that this will be only a small and low impact query to the cluster itself when the masters are running and reachable.

This assumes that neon-cluster-operator is itself healthy. We can mitigate this by having the cluster operator update a UTC timestamp on the cluster status whenever it updates status and also have the cluster status include a property specifying the maximum UTC time when the operator must have updated the status and timestamps before clients should consider the cluster to be unhealthy, regardless of what the cluster status says.

NOTE: This requires that neon-cluster operator or the cluster node VMs have credentials/permissions to the cloud API. It's best to defer this until we also implement GatewayAPI/network configuration.

marcusbooyah commented 1 year ago

We can use watches instead of queries