Open dharapvj opened 2 months ago
even with 30seconds on webhook - it takes many attempts before finally applying the MD.
here are log entries from KKP apiserver after 30 second timeout
{"level":"error","time":"2024-09-05T06:09:34.605Z","caller":"handler/routing.go:152","msg":"failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'","request":"/api/v2/projects/XXX/clusters/YYY/machinedeployments"}
{"level":"error","time":"2024-09-05T06:09:42.670Z","caller":"handler/routing.go:152","msg":"failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'","request":"/api/v2/projects/hzpqm5hzd5/clusters/evrv86lkgm/machinedeployments"}
{"level":"error","time":"2024-09-05T06:14:49.226Z","caller":"handler/routing.go:152","msg":"Cluster components are not ready yet","request":"/api/v2/projects/XXX/clusters/YYY/machinedeployments"}
Lately, we see continuous failures to rollout new MD in Azure environments.
The error is always about machine-controller-webhook timing out. Error is seen in kubeone as well as KKP user-clusters.
Some API (mostly about VM sizes) in azure has become very slow (or we need better filters in our API call)
Here are logs from KKP user-cluster based MD
I have seen that if I increase wehbook timeout to 30s situation improves a bit.
But in general - since webhook can only have max 30s timeout - we should consider caching the list of VMs to speed things up.