No error found on OpenStack failure

franzudev commented 1 year ago

/kind bug

What steps did you take and what happened:

Create cluster
OpenStack throws 500 No Valid Host was Found, visible in the UI
No errors/events found on controllers

What did you expect to happen: I expect to see the error in logs and/or events, something similar to the one below

E0315 10:59:28.971256       1 controller.go:317] controller/openstackcluster "msg"="Reconciler error" "error"="failed to delete load balancer: Expected HTTP response code [] when accessing [DELETE https://10.112.70.238:9876/v2.0/lbaas/loadbalancers/ac1600c2-28ce-4939-98e2-55619fe06609?cascade=true], but got 409 instead\n{\"faultcode\": \"Client\", \"faultstring\": \"Invalid state PENDING_DELETE of loadbalancer resource ac1600c2-28ce-4939-98e2-55619fe06609\", \"debuginfo\": null}" "name"="example-error" "namespace"="example" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackCluster"

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): 0.6.4
Cluster-API version: 1.3.2
OpenStack version: zed
Minikube/KIND version:
Kubernetes version (use kubectl version): 1.23.7
OS (e.g. from /etc/os-release):

qeqar commented 1 year ago

This is a more general error, I had the same behavior with a different kind of error no port found, network not attached (or something like that).

So if the vm goes into error state the machine resource does not know or recognized, just keeps waiting.

jichenjc commented 1 year ago

OpenStack throws 500 No Valid Host was Found, visible in the UI

didn't get your desired behavior, you want CAPO to report error in event when something wrong? something like you pasted?

 controller/openstackcluster "msg"="Reconciler error" "error"="failed to delete load balancer: Expected HTTP response code [] when accessing [DELETE https://10.112.70.238:9876/v2.0/lbaas/loadbalancers/ac1600c2-28ce-4939-98e2-55619fe06609?cascade=true], but got 409 instead\n{\"faultcode\": \"Client\", \"faultstring\": \"Invalid state PENDING_DELETE of loadbalancer resource ac1600c2-28ce-4939-98e2-55619fe06609\", \"debuginfo\": null}" "name"="example-error" "namespace"="example" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackCluster"

franzudev commented 1 year ago

Yes, I was wondering why on lb failure it logs the error with the HTTP status code, but when it's the instance to fail it doesn't show the error's log. I think it's a bit inconsistent behavior, don't you think?

mdbooth commented 1 year ago

I think we can get the fault code and message from the server object even as an unprivileged user. I wouldn't want to match on the message, though, so I don't think we can distinguish this 500 from other 500s. Thinking about it, 500 doesn't seem like the right fault code here. Not 100% sure what would be most appropriate. Maybe 409? Regardless, this error code means that we can't distinguish this error from actual internal server errors.

That said, to the best of my knowledge nova never retries anything anyway, so I think we can say that as long as the server hasn't been created yet a 500 is terminal and we should mark the server failed.

How about if the controller loop sets a FAILURE condition (i.e. unrecoverable, will not retry) if we get an ERROR status before the server first becomes active, and a different, new ERROR condition with a message if we get an ERROR status after the server has become active?

In both cases the server should be deleted before the Machine is deleted.

@stephenfin thoughts?

seanschneeweiss commented 1 year ago

if we get an ERROR status before the server first becomes active, and a different, new ERROR condition with a message if we get an ERROR status after the server has become active?

In both cases the server should be deleted before the Machine is deleted.

@mdbooth Could you help me with the deletion order. In my understanding, deleting the openstackmachine would then trigger the deletion of the server.

mdbooth commented 1 year ago

if we get an ERROR status before the server first becomes active, and a different, new ERROR condition with a message if we get an ERROR status after the server has become active?

In both cases the server should be deleted before the Machine is deleted.

@mdbooth Could you help me with the deletion order. In my understanding, deleting the openstackmachine would then trigger the deletion of the server.

That's right. I was referring to the cleanup order due to the Finalizer. i.e.:

sequenceDiagram
    User->>API: Delete
    API->>Controller: Sees DeletionTimestamp
    Controller->>Nova: Delete
    Controller->>API: Remove Finalizer
    API->>User: Object is deleted

stephenfin commented 1 year ago

I think we can get the fault code and message from the server object even as an unprivileged user. I wouldn't want to match on the message, though, so I don't think we can distinguish this 500 from other 500s. Thinking about it, 500 doesn't seem like the right fault code here. Not 100% sure what would be most appropriate. Maybe 409? Regardless, this error code means that we can't distinguish this error from actual internal server errors.

Server creation happens asynchronously, so as long as the request was accepted (i.e. the various nova services are up and the request was correctly formed) you'll get a HTTP 202 back in response, with a minimal body:

{
    "server": {
        "id": "ec8d920a-a6ef-477e-9832-b116c1b191c9",
        "links": [
            ...
        ],
        "OS-DCF:diskConfig": "MANUAL",
        "security_groups": [
            ...
        ],
        "adminPass": "foo"
    }
}

If you want to check the state of this request, you'll need to do an additional request to GET /server/{serverID}. If the server failed to boot, you'll see a fault field present and the server will have a status of ERROR. The response will still be a HTTP 200 though: the 500 is indicated in the fault.code field:

{
    "server": {
        "id": "ec8d920a-a6ef-477e-9832-b116c1b191c9",
        "name": "test-server",
        "status": "ERROR",
        "fault": {
            "code": 500,
            "created": "2023-04-19T14:46:18Z",
            "message": "No valid host was found. "
        },
        ...
    }
}

So you can distinguish between failures to schedule an instance and failures to talk to nova at all by looking at the status code to GET /server/{serverID}: if it's HTTP 200 then nova is likely working okay and the server simply failed to boot, while if it's HTTP 500 you've got a bigger issue. Heck, a HTTP 500 response to the original POST request would imply a serious issue.

That said, to the best of my knowledge nova never retries anything anyway, so I think we can say that as long as the server hasn't been created yet a 500 is terminal and we should mark the server failed.

Agreed. It can take some time to fail though, so you need to poll. I've guessing we're doing this already.

How about if the controller loop sets a FAILURE condition (i.e. unrecoverable, will not retry) if we get an ERROR status before the server first becomes active, and a different, new ERROR condition with a message if we get an ERROR status after the server has become active?

Sounds reasonable. tbh, fault.message == No valid host was found. is a pretty reliable signal to watch for wrt scheduling failures. That will capture the vast majority of them.

In both cases the server should be deleted before the Machine is deleted.

@stephenfin thoughts?

mdbooth commented 1 year ago

Sounds reasonable. tbh, fault.message == No valid host was found. is a pretty reliable signal to watch for wrt scheduling failures. That will capture the vast majority of them.

In both cases the server should be deleted before the Machine is deleted. @stephenfin thoughts?

My concern here is matching on fault description strings in general. I've been burned before by:

Typos being fixed
Updates for style (e.g. removing that trailing '.')
Servers whose local language is French, and the strings are translated

These are just examples. Your painful memories may be different 😉

stephenfin commented 1 year ago

You could also just watch the fault field. Any value here would likely imply a scheduling failure, at least early on.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

stephenfin commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 7 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1506#issuecomment-2016957032): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

EmilienM commented 5 months ago

/reopen /remove-lifecycle rotten we still have this bug, we can see it in CI in fact.

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/2122/pull-cluster-api-provider-openstack-e2e-test/1800607078713135104/artifacts/clusters/bootstrap/logs/capo-system/capo-controller-manager/capo-controller-manager-7d957cb6cd-5dgc2/manager.log

k8s-ci-robot commented 5 months ago

@EmilienM: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1506#issuecomment-2161520856): >/reopen >/remove-lifecycle rotten >we still have this bug, we can see it in CI in fact. > >https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/2122/pull-cluster-api-provider-openstack-e2e-test/1800607078713135104/artifacts/clusters/bootstrap/logs/capo-system/capo-controller-manager/capo-controller-manager-7d957cb6cd-5dgc2/manager.log Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

mdbooth commented 5 months ago

/reopen /remove-lifecycle rotten we still have this bug, we can see it in CI in fact.

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/2122/pull-cluster-api-provider-openstack-e2e-test/1800607078713135104/artifacts/clusters/bootstrap/logs/capo-system/capo-controller-manager/capo-controller-manager-7d957cb6cd-5dgc2/manager.log

Which error are you referring to?

EmilienM commented 5 months ago

Invalid state PENDING_DELETE of loadbalancer resource

This causes the LB to be deleted with errors in the middle.

EmilienM commented 5 months ago

I just realized this bug report isn't about this Octavia issue :man_facepalming: I'll open a new one.

mdbooth commented 5 months ago

/close

k8s-ci-robot commented 5 months ago

@mdbooth: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1506#issuecomment-2163035071): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-openstack

No error found on OpenStack failure #1506