kubermatic / machine-controller

Apache License 2.0
307 stars 130 forks source link

azure machine-controller webhook timeout #1857

Open dharapvj opened 2 months ago

dharapvj commented 2 months ago

Lately, we see continuous failures to rollout new MD in Azure environments.

The error is always about machine-controller-webhook timing out. Error is seen in kubeone as well as KKP user-clusters.

Some API (mostly about VM sizes) in azure has become very slow (or we need better filters in our API call)

Here are logs from KKP user-cluster based MD

failed to create machine deployment: Internal error occurred: failed calling webhook "machine-controller.kubermatic.io-machinedeployments": failed to call webhook: Post "https://machine-controller-webhook.cluster-XXXXX.svc.cluster.local./machinedeployments?timeout=10s": context deadline exceeded
{
  "error": {
    "code": 500,
    "message": "failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'"
  }
}

I have seen that if I increase wehbook timeout to 30s situation improves a bit.

But in general - since webhook can only have max 30s timeout - we should consider caching the list of VMs to speed things up.

dharapvj commented 2 months ago

even with 30seconds on webhook - it takes many attempts before finally applying the MD.

here are log entries from KKP apiserver after 30 second timeout

{"level":"error","time":"2024-09-05T06:09:34.605Z","caller":"handler/routing.go:152","msg":"failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'","request":"/api/v2/projects/XXX/clusters/YYY/machinedeployments"}
{"level":"error","time":"2024-09-05T06:09:42.670Z","caller":"handler/routing.go:152","msg":"failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'","request":"/api/v2/projects/hzpqm5hzd5/clusters/evrv86lkgm/machinedeployments"}
{"level":"error","time":"2024-09-05T06:14:49.226Z","caller":"handler/routing.go:152","msg":"Cluster components are not ready yet","request":"/api/v2/projects/XXX/clusters/YYY/machinedeployments"}