Closed schlichtanders closed 1 year ago
further investigation brought me to this terraform documentation where it is mentioned, that the id
field is unique, but not the name.
however the id is not set at all in kube-hetzner
data "hcloud_load_balancer" "cluster" {
count = local.has_external_load_balancer ? 0 : 1
name = var.cluster_name
depends_on = [null_resource.kustomization]
}
Also this is a data
definition, while the alternative control plane load_balancer is a resource
definition. Maybe this interacts as well
Reading the documentation for the resource "hcloud_load_balancer" https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/load_balancer it is clear that the real issue is not the id (it is not needed for the resource definition), but really the use of the data
definition.
Why is the default load_balancer a data
definition? This has the described disadvantage, but this disadvantage was not known and it had other advantages which is why it was done this way.
@mysticaltech you last edited this definition. Do you know why you have chosen this to be a data definition?
@schlichtanders Basically, this LB has nothing to do with terraform, it's deployed via the CCM on the request of the ingress controller. That's one of the reason we need the cleanupkh script, same for the new instances sprung up by the autoscaler, both do not originate from terraform. Via terraform we just read the deployed LB to ensure that everything is fine at the end of the deployment.
And we do set the cluster name as the lb name, so there would be an issue only if you did not change your cluster name for your new cluster!
And we do set the cluster name as the lb name, so there would be an issue only if you did not change your cluster name for your new cluster!
I changed the cluster_name and my previously named loadbalancer jolin
got renamed to the new cluster-name jolin-20230903t120301z
, please re-add the bug label.
EDIT: So if I understand your confidence correctly, it seems like my renaming was due to some other mistake somewhere. Still super weird as it got renamed from the old cluster_name to the new cluster_name. This should not be possible if I understand what you are saying... I will test tomorrow again
I was able to test it again today (took quite some time).
and apparently I misunderstood how terraform works. It seems I've been running into terraform newbie issue:
running into completely different issues now.. e.g. #963
closing this as caused by confusion
it actually got renamed.
I am now using terraform workspaces nicely, and have everything twice, but the load balancer. It got renamed...
@schlichtanders I believe you, but please show me some hcloud cli outputs, or screenshots of the ui, or anything. Because not sure we can do a lot about this, we set the name as shown above, so unless the cluster name changes, it shouldn't be renamed. And everything is managed by the ingress controller and ccm automatically, so maybe there is a bug upstream. Need more info!
The final error goes away after waiting some time - apparently the load balancer is still setup.
I found this documentation which mentions that as soon as the name field is defined, it tries to import an existing load-balancer.
I want to look deeper into the respective code, but my current best guess is that the load-balancer could be defined before via terraform. Maybe that already fixes this issue.
@schlichtanders Yes, I saw the error, weird. The doc you found is super interesting, I had no idea we could create and use an lb previously deployed. That is theoretically doable as we can configure the service part in both nginx and traefik.
@aleksasiriski @M4t7e @ifeulner What do you think about that folks? I think it would be nice to go that route, it will simplify our code, and fix the occasional lb recreation we saw in the past.
I just tried to create the loadbalancer in advance, but somehow it did not work
here a screenshot from my hetzner
these two loadbalancers are now from a SINGLE terraform apply.
jolin
(EDIT: and a different location than specified)
Looking into the kustomize service definition which got deployed, really the timestamped name is used...
So something must get confused... Here the pointer to the current source code of this ccm https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/hcloud/load_balancers.go#L93-L166
@mysticaltech is there a way to inspect the full debug logs from the terraform apply?
on the console always only the logs without sensible information is shown. It would be great to see if the full logs have some hints. (couldn't find a way myself by googling terraform logs, but maybe there is a kube-hetzner way of doing it)
I opened a bug report https://github.com/hetznercloud/hcloud-cloud-controller-manager/issues/504
Just a thought: It might be an interaction with the restoring (I am still trying to finish the restoring).
The backup may have had the old location and name...
Maybe via a match of ids this could also explain why a second deployment to the same cluster somehow took over the load-balancer.
I think I understood also the final missing pieces:
LabelServiceUID
This explains why every of my restoration deployments picked the same loadbalancer, leading to the renaming behaviour.
I now added a kubectl delete service/traefik -n traefik
line right after kubectl becomes available and indeed the load balancer is no longer renamed...
the downside is that this actually takes a while... probably because it was already deployed again and the loadbalancer was also already created.
I haven't found a sound way to delete this service reference before it starts. Not sure what would be the ideal way to work with it. At the other ticket I suggest to have a flag which makes it explicit that only the cluster-name should count.
I am closing this issue, because I am now able to have two healthy load-balancers without one deployment interfering with the other. Hence it is an interaction with the restoration :+1: .
@schlichtanders Interesting! Thanks for sharing your in-depth analysis. So what are your conclusions on the ability to do restorations with this project?
Yesterday I luckily was able to do a proper restoration without the side-effects of this issue. Only a tiny addition is actually needed - see my pullrequest https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/968
Further thoughts I had:
But to have a restoration to the same project than really needs to get rid of this load-balancer kubernetes configuration already before the restoration is done (I really want to prevent any kind of side-effects with existing load-balancers).
kubectl
is only running if the cluster is running, but then also the restoration gets already applied which is too late. :slightly_frowning_face: etcdctl
instead on the raw etcd storage and, I am so happy, it seems to have worked. :slightly_smiling_face: So now I can do a etcd restore and delete the traefik service from it before any config gets executed. This is tested now in fixes everything as hoped for.
Description
This just destroyed my cluster :cry:
I wanted to create a new cluster next to my previous one, but apparently the old load balancer got deleted (or renamed) ?
I have no clue how this can happen... I hope for quick help to get a fix as soon as possible. And prevent something like this for others.
EDIT: I just realized that the public IP of the load balancer stayed the same - hence somehow the old one got renamed (instead of creating an additional new one)
Kube.tf file
Screenshots
No response
Platform
Linux