kubernetes-sigs / cluster-api-provider-packet

Cluster API Provider Packet (now Equinix Metal)
https://deploy.equinix.com/labs/cluster-api-provider-packet/
Apache License 2.0
100 stars 41 forks source link

What do when encountered a provisioning failure #43

Closed gianarb closed 3 years ago

gianarb commented 4 years ago

We have encountered a provisioning failure for your device test1-g7tqp-master-0.

The PacketAPI returns:

403 You are not authorized to view this device

Right now the MachineController tries over and over without an end. it does not sound great. Should we treat the error as we do when the controller receives a 404? We assume that the server is not running anymore and we make the reconciliation as a success. It is not ideal because 403 will may be fixed by generating a new API key. I think the API is not returning the right status code here.

@deitch

deitch commented 4 years ago

This is a good point. We need better logic for it. "403 not authorized" is not the same as "404 does not exist". In a delete, if we cannot find the device, then the state is how we want it to be and successful. In a create, if 404, we want to keep trying until we find it.

But a 403 needs to generate a different kind of error. What is the usual k8s controller behaviour here?

gianarb commented 4 years ago

As I wrote, I think the status code returned by the API makes things harder. When I open the packet.net UI I do not see that server. So for me, it should be 404.

Do we know why an unprovisioned cluster returns 403 and it is not just treated as a cluster not found?

deitch commented 4 years ago

What is the query that we made that returns 403?

gianarb commented 4 years ago

GET device by ID

deitch commented 4 years ago

That is strange. You can replicate it regularly? 403 is an auth issue, not a device-not-found issue. Unless it returns 403 if a device isn't found, because maybe it exists but you don't have rights to see it?

Can you try with direct curl?

gianarb commented 4 years ago

No I do not know how to reproduce a provisioning failure in our cloud

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

deitch commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-packet/issues/43#issuecomment-756121879): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tahaozket commented 3 years ago

This issue still persists. Metal produces a GRUB error during provisioning server and the server disappears from Metal dashboard. But GET device by ID returns 403 for that machine. That's because cluster-api-provider-packet-controller-manager stucks at Reconciling state and can't continue to Reconciling.

kubectl logs -f cluster-api-provider-packet-controller-manager

2021-05-01T15:04:16.705Z INFO controllers.PacketMachine.infrastructure.cluster.x-k8s.io/v1alpha3 Reconciling PacketMachine {"packetmachine": "default/capi-quickstart-control-plane-ppmbg", "machine": "capi-quickstart-control-plane-jlt5k", "cluster": "capi-quickstart", "packetcluster": "capi-quickstart"} 2021-05-01T15:04:16.942Z ERROR controller-runtime.controller Reconciler error {"controller": "packetmachine", "name": "capi-quickstart-control-plane-ppmbg", "namespace": "default", "error": "GET https://api.equinix.com/metal/v1/devices/XXXXXXXXX?include=facility: 403 You are not authorized to view this device "} github.com/go-logr/zapr.(zapLogger).Error /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:257 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:231 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).worker /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:210 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1 /go/pkg/mod/k8s.io/apimachinery@v0.17.12/pkg/util/wait/wait.go:152 k8s.io/apimachinery/pkg/util/wait.JitterUntil /go/pkg/mod/k8s.io/apimachinery@v0.17.12/pkg/util/wait/wait.go:153 k8s.io/apimachinery/pkg/util/wait.Until /go/pkg/mod/k8s.io/apimachinery@v0.17.12/pkg/util/wait/wait.go:88

kubectl get machine

NAME PROVIDERID PHASE VERSION capi-quickstart-control-plane-jlt5k Provisioning v1.18.16 capi-quickstart-worker-a-7766f9f9b9-d98xm Pending v1.18.16 capi-quickstart-worker-a-7766f9f9b9-gsp87 Pending v1.18.16 capi-quickstart-worker-a-7766f9f9b9-l2z4m Pending v1.18.16

k8s-ci-robot commented 3 years ago

@tahaozket: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-packet/issues/43#issuecomment-830648737): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.