Closed franzudev closed 5 months ago
This is a more general error, I had the same behavior with a different kind of error no port found, network not attached
(or something like that).
So if the vm goes into error state the machine resource does not know or recognized, just keeps waiting.
OpenStack throws 500 No Valid Host was Found, visible in the UI
didn't get your desired behavior, you want CAPO to report error in event when something wrong? something like you pasted?
controller/openstackcluster "msg"="Reconciler error" "error"="failed to delete load balancer: Expected HTTP response code [] when accessing [DELETE https://10.112.70.238:9876/v2.0/lbaas/loadbalancers/ac1600c2-28ce-4939-98e2-55619fe06609?cascade=true], but got 409 instead\n{\"faultcode\": \"Client\", \"faultstring\": \"Invalid state PENDING_DELETE of loadbalancer resource ac1600c2-28ce-4939-98e2-55619fe06609\", \"debuginfo\": null}" "name"="example-error" "namespace"="example" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackCluster"
Yes, I was wondering why on lb failure it logs the error with the HTTP status code, but when it's the instance to fail it doesn't show the error's log. I think it's a bit inconsistent behavior, don't you think?
I think we can get the fault code and message from the server object even as an unprivileged user. I wouldn't want to match on the message, though, so I don't think we can distinguish this 500 from other 500s. Thinking about it, 500 doesn't seem like the right fault code here. Not 100% sure what would be most appropriate. Maybe 409? Regardless, this error code means that we can't distinguish this error from actual internal server errors.
That said, to the best of my knowledge nova never retries anything anyway, so I think we can say that as long as the server hasn't been created yet a 500 is terminal and we should mark the server failed.
How about if the controller loop sets a FAILURE condition (i.e. unrecoverable, will not retry) if we get an ERROR status before the server first becomes active, and a different, new ERROR condition with a message if we get an ERROR status after the server has become active?
In both cases the server should be deleted before the Machine is deleted.
@stephenfin thoughts?
if we get an ERROR status before the server first becomes active, and a different, new ERROR condition with a message if we get an ERROR status after the server has become active?
In both cases the server should be deleted before the Machine is deleted.
@mdbooth Could you help me with the deletion order. In my understanding, deleting the openstackmachine would then trigger the deletion of the server.
if we get an ERROR status before the server first becomes active, and a different, new ERROR condition with a message if we get an ERROR status after the server has become active?
In both cases the server should be deleted before the Machine is deleted.
@mdbooth Could you help me with the deletion order. In my understanding, deleting the openstackmachine would then trigger the deletion of the server.
That's right. I was referring to the cleanup order due to the Finalizer. i.e.:
sequenceDiagram
User->>API: Delete
API->>Controller: Sees DeletionTimestamp
Controller->>Nova: Delete
Controller->>API: Remove Finalizer
API->>User: Object is deleted
I think we can get the fault code and message from the server object even as an unprivileged user. I wouldn't want to match on the message, though, so I don't think we can distinguish this 500 from other 500s. Thinking about it, 500 doesn't seem like the right fault code here. Not 100% sure what would be most appropriate. Maybe 409? Regardless, this error code means that we can't distinguish this error from actual internal server errors.
Server creation happens asynchronously, so as long as the request was accepted (i.e. the various nova services are up and the request was correctly formed) you'll get a HTTP 202
back in response, with a minimal body:
{
"server": {
"id": "ec8d920a-a6ef-477e-9832-b116c1b191c9",
"links": [
...
],
"OS-DCF:diskConfig": "MANUAL",
"security_groups": [
...
],
"adminPass": "foo"
}
}
If you want to check the state of this request, you'll need to do an additional request to GET /server/{serverID}
. If the server failed to boot, you'll see a fault
field present and the server will have a status
of ERROR
. The response will still be a HTTP 200 though: the 500
is indicated in the fault.code
field:
{
"server": {
"id": "ec8d920a-a6ef-477e-9832-b116c1b191c9",
"name": "test-server",
"status": "ERROR",
"fault": {
"code": 500,
"created": "2023-04-19T14:46:18Z",
"message": "No valid host was found. "
},
...
}
}
So you can distinguish between failures to schedule an instance and failures to talk to nova at all by looking at the status code to GET /server/{serverID}
: if it's HTTP 200 then nova is likely working okay and the server simply failed to boot, while if it's HTTP 500 you've got a bigger issue. Heck, a HTTP 500 response to the original POST
request would imply a serious issue.
That said, to the best of my knowledge nova never retries anything anyway, so I think we can say that as long as the server hasn't been created yet a 500 is terminal and we should mark the server failed.
Agreed. It can take some time to fail though, so you need to poll. I've guessing we're doing this already.
How about if the controller loop sets a FAILURE condition (i.e. unrecoverable, will not retry) if we get an ERROR status before the server first becomes active, and a different, new ERROR condition with a message if we get an ERROR status after the server has become active?
Sounds reasonable. tbh, fault.message
== No valid host was found.
is a pretty reliable signal to watch for wrt scheduling failures. That will capture the vast majority of them.
In both cases the server should be deleted before the Machine is deleted.
@stephenfin thoughts?
Sounds reasonable. tbh,
fault.message
==No valid host was found.
is a pretty reliable signal to watch for wrt scheduling failures. That will capture the vast majority of them.In both cases the server should be deleted before the Machine is deleted. @stephenfin thoughts?
My concern here is matching on fault description strings in general. I've been burned before by:
These are just examples. Your painful memories may be different 😉
You could also just watch the fault
field. Any value here would likely imply a scheduling failure, at least early on.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen /remove-lifecycle rotten we still have this bug, we can see it in CI in fact.
@EmilienM: Reopened this issue.
/reopen /remove-lifecycle rotten we still have this bug, we can see it in CI in fact.
Which error are you referring to?
Invalid state PENDING_DELETE of loadbalancer resource
This causes the LB to be deleted with errors in the middle.
I just realized this bug report isn't about this Octavia issue :man_facepalming: I'll open a new one.
/close
@mdbooth: Closing this issue.
/kind bug
What steps did you take and what happened:
500 No Valid Host was Found
, visible in the UIWhat did you expect to happen: I expect to see the error in logs and/or events, something similar to the one below
Environment:
git rev-parse HEAD
if manually built): 0.6.4kubectl version
): 1.23.7/etc/os-release
):