kubernetes-sigs / cluster-api-provider-packet

Cluster API Provider Packet (now Equinix Metal)
https://deploy.equinix.com/labs/cluster-api-provider-packet/
Apache License 2.0
100 stars 41 forks source link

ClusterAPI stucks when Provisioning failure happens #245

Closed tahaozket closed 2 years ago

tahaozket commented 3 years ago

This issue still persists. Metal produces a GRUB error during provisioning server and the server disappears from Metal dashboard. But GET device by ID returns 403 for that machine. That's because cluster-api-provider-packet-controller-manager stucks at Reconciling state and can't continue to Reconciling.

kubectl logs -f cluster-api-provider-packet-controller-manager

2021-05-01T15:04:16.705Z INFO controllers.PacketMachine.infrastructure.cluster.x-k8s.io/v1alpha3 Reconciling PacketMachine {"packetmachine": "default/capi-quickstart-control-plane-ppmbg", "machine": "capi-quickstart-control-plane-jlt5k", "cluster": "capi-quickstart", "packetcluster": "capi-quickstart"} 2021-05-01T15:04:16.942Z ERROR controller-runtime.controller Reconciler error {"controller": "packetmachine", "name": "capi-quickstart-control-plane-ppmbg", "namespace": "default", "error": "GET https://api.equinix.com/metal/v1/devices/XXXXXXXXX?include=facility: 403 You are not authorized to view this device "} github.com/go-logr/zapr.(zapLogger).Error /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:257 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:231 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).worker /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:210 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1 /go/pkg/mod/k8s.io/apimachinery@v0.17.12/pkg/util/wait/wait.go:152 k8s.io/apimachinery/pkg/util/wait.JitterUntil /go/pkg/mod/k8s.io/apimachinery@v0.17.12/pkg/util/wait/wait.go:153 k8s.io/apimachinery/pkg/util/wait.Until /go/pkg/mod/k8s.io/apimachinery@v0.17.12/pkg/util/wait/wait.go:88

kubectl get machine

NAME PROVIDERID PHASE VERSION capi-quickstart-control-plane-jlt5k Provisioning v1.18.16 capi-quickstart-worker-a-7766f9f9b9-d98xm Pending v1.18.16 capi-quickstart-worker-a-7766f9f9b9-gsp87 Pending v1.18.16 capi-quickstart-worker-a-7766f9f9b9-l2z4m Pending v1.18.16

Originally posted by @tahaozket in https://github.com/kubernetes-sigs/cluster-api-provider-packet/issues/43#issuecomment-830648663

displague commented 3 years ago

@tahaozket I believe there are upstream API changes in the works to report 404 in these conditions rather than 403.

The 403 behavior happens because the device instance ID is still technically valid after a failure. The device is removed from the account and cordoned until Equinix Metal engineers have investigated the provisioning failure (which could very well be hardware related).

That said, in the Equinix Metal Terraform provider we are treating 403 as effectively 404, if you don't have access to it, treat it like it does not exist (so the desired device can be created). I believe this behavior fits the EM token, organization, and project permission model well.

k8s-triage-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

detiber commented 3 years ago

/lifecycle frozen