`stackit_ske_cluster` fails recovering `kube_config`

stackitcloud / terraform-provider-stackit

The official Terraform provider for STACKIT

https://registry.terraform.io/providers/stackitcloud/stackit

Apache License 2.0

41 stars 14 forks source link

`stackit_ske_cluster` fails recovering `kube_config` #244

Closed JorTurFer closed 9 months ago

JorTurFer commented 9 months ago

Hello We have some automations on top of the terraform provider and they have started to fail with 400 Bad Request during apply commands:

Error: Error creating/updating cluster
│ 
│   with module.ske.stackit_ske_cluster.cluster,
│   on ../modules/SKE/02_ske.tf line 1, in resource "stackit_ske_cluster" "cluster":
│    1: resource "stackit_ske_cluster" "cluster" {
│ 
│ Getting credential: fetching cluster credentials: 400 Bad Request, status code 400, Body: {"code":"FailedPrecondition","message":"endpoint is unavailable for
│ clusters that have already obtained a kubeconfig or triggered a credentials rotation via the new endpoints","details":""}

I think that this could be related with an unexpected breaking change due to recent changes in the API to generate short lived tokens instead of a long term token but currently the provider is broken as it can continue from there.

I'm willing to contribute with a fix if it helps speeding up the process, as we are a bit blocked by this

JorTurFer commented 9 months ago

Given that, I'm not 100% sure about the best approach to fix this. From my (ignorant) pov, we could :

Generate a token with a fixed expiration, customizable by users within the resource (or just with a default expiration, e.g: 1 hour or so)
Supporting a new resource which allows to generate those tokens and removing the kube_config attribute from stackit_ske_cluster

I'd say that just generating a short lived token could be the easiest fix the the issue meanwhile another option is applied (or not)

vicentepinto98 commented 9 months ago

Hello @JorTurFer , I will have a look into this issue as soon as possible.

From the first look, there should be an easy way to fix the issue by just replacing the deprecated endpoint in our provider implementation with the new one.

JorTurFer commented 9 months ago

From the first look, there should be an easy way to fix the issue by just replacing the deprecated endpoint in our provider implementation with the new one.

Yeah, I think so.

I will have a look into this issue as soon as possible.

I'm willing to open a PR later on, it's something that blocks us, I've already reviewed the provider and the underlying client xD. If we agree with the expiration for the token, that's all that I need (or at least, I hope so). I guess that 1 hour sounds as something good

vicentepinto98 commented 9 months ago

We just release a new version of the SKE SDK v0.10.0 which includes the new endpoints for credentials rotation. If you want to raise the PR making the fix (replacing the endpoints) I would be happy to review and support.

vicentepinto98 commented 9 months ago

But I also want to check how exactly we should proceed, because now there is no possibility to GET existing kubeconfigs, which means the kubeconfig field in the resource needs to be handled especially. I am in contact with the SKE team and will give you an update on this

JorTurFer commented 9 months ago

Getting a fresh kubeconfig with a short live could work for terraform, couldn't? I mean, if the token is required for terraform operations, 15-60 min can be enough, and for longer operations, users can create their service accounts and using their tokens (evaluated by cluster RBAC)

vicentepinto98 commented 9 months ago

Yes, I also just double checked with the SKE team.

So for short-term we can replace the endpoint and always generate a new kube_config with the default expiration time of 1h. This will keep the current implementation in a non-breaking way and hopefully fix your issue.

For medium-term, we will check the possibility to create a new kube_config resource and deprecate the field from the cluster.

vicentepinto98 commented 9 months ago

Are you gonna work the PR or should I do it?

JorTurFer commented 9 months ago

Are you gonna work the PR or should I do it?

As you prefer, I'm finishing the migration from community provider to the official one, and once I finish I'll start with this

vicentepinto98 commented 9 months ago

Hello, We have add a deprecation notice to the kubeconfig field on the stackit_ske_cluster resource and added a new resource stackit_ske_kubeconfig, that should be used going forward (for clusters with k8s version >= 1.27 or if the new credentials flow has been already used).

More info on the new SKE credentials rotation process here.

This will be included in the next provider release.

vicentepinto98 commented 8 months ago

A new version of the provider v0.11.0 is released, which includes the fix