We're calling IHostingManager.GetClusterStatusAsync() every 5 seconds right now from neonDESKTOP to update the task bar icon state.
This is quite costly for AWS because it requires listing all cluster VMs which won't be too bad for smaller clusters but will result in significant network traffic for larger clusters.
This is even worse for Azure because we need to list the cluster VMs and then perform also individual status queries for every VM!
Potential Optimization:
Have neon-cluster-operator periodically query the API server for cluster node state and include this state in the cluster health status.
Have hosting managers query the cluster health status first.
Query the cloud VM state only when the cluster status call fails.
The nice thing about this approach is that this will be only a small and low impact query to the cluster itself when the masters are running and reachable.
This assumes that neon-cluster-operator is itself healthy. We can mitigate this by having the cluster operator update a UTC timestamp on the cluster status whenever it updates status and also have the cluster status include a property specifying the maximum UTC time when the operator must have updated the status and timestamps before clients should consider the cluster to be unhealthy, regardless of what the cluster status says.
NOTE: This requires that neon-cluster operator or the cluster node VMs have credentials/permissions to the cloud API. It's best to defer this until we also implement GatewayAPI/network configuration.
We're calling
IHostingManager.GetClusterStatusAsync()
every 5 seconds right now from neonDESKTOP to update the task bar icon state.This is quite costly for AWS because it requires listing all cluster VMs which won't be too bad for smaller clusters but will result in significant network traffic for larger clusters.
This is even worse for Azure because we need to list the cluster VMs and then perform also individual status queries for every VM!
Potential Optimization:
The nice thing about this approach is that this will be only a small and low impact query to the cluster itself when the masters are running and reachable.
This assumes that neon-cluster-operator is itself healthy. We can mitigate this by having the cluster operator update a UTC timestamp on the cluster status whenever it updates status and also have the cluster status include a property specifying the maximum UTC time when the operator must have updated the status and timestamps before clients should consider the cluster to be unhealthy, regardless of what the cluster status says.
NOTE: This requires that neon-cluster operator or the cluster node VMs have credentials/permissions to the cloud API. It's best to defer this until we also implement GatewayAPI/network configuration.