Machine Controller crashes when creating hetzner cloud server

ewallat commented 3 years ago

Hi together,

when creating a user cluster the machine controller crashes during the creation of a hetzner cloud server:

{"level":"info","time":"2021-04-12T11:44:44.223Z","logger":"http-prober","caller":"http-prober/main.go:109","msg":"Probing","attempt":1,"max-attempts":100,"target":"https://apiserver-external.cluster-c8ck48pcd9.svc.cluster.local./healthz"}
{"level":"info","time":"2021-04-12T11:44:44.235Z","logger":"http-prober","caller":"http-prober/main.go:98","msg":"Hostname resolved","hostname":"apiserver-external.cluster-c8ck48pcd9.svc.cluster.local.","address":"10.103.44.151:443"}
{"level":"info","time":"2021-04-12T11:44:44.241Z","logger":"http-prober","caller":"http-prober/main.go:122","msg":"Endpoint is available"}
{"level":"info","time":"2021-04-12T11:44:44.326Z","logger":"http-prober","caller":"http-prober/main.go:129","msg":"All CRDs became available"}
I0412 11:44:44.451336       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/machine-controller...
I0412 11:45:00.041870       1 leaderelection.go:253] successfully acquired lease kube-system/machine-controller
W0412 11:45:15.050128       1 warnings.go:67] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0412 11:45:15.061807       1 warnings.go:67] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
I0412 11:45:15.143245       1 migrations.go:175] CRD machines.machine.k8s.io not present, no migration needed
I0412 11:45:15.143302       1 migrations.go:54] Starting to migrate providerConfigs to providerSpecs
I0412 11:45:15.172830       1 migrations.go:136] Successfully migrated providerConfigs to providerSpecs
I0412 11:45:15.172970       1 plugin.go:95] looking for plugin "machine-controller-userdata-centos"
I0412 11:45:15.173184       1 plugin.go:123] checking "/usr/local/bin/machine-controller-userdata-centos"
I0412 11:45:15.173459       1 plugin.go:136] found '/usr/local/bin/machine-controller-userdata-centos'
I0412 11:45:15.173484       1 plugin.go:95] looking for plugin "machine-controller-userdata-coreos"
I0412 11:45:15.173511       1 plugin.go:123] checking "/usr/local/bin/machine-controller-userdata-coreos"
I0412 11:45:15.173573       1 plugin.go:136] found '/usr/local/bin/machine-controller-userdata-coreos'
I0412 11:45:15.173582       1 plugin.go:95] looking for plugin "machine-controller-userdata-ubuntu"
I0412 11:45:15.173602       1 plugin.go:123] checking "/usr/local/bin/machine-controller-userdata-ubuntu"
I0412 11:45:15.173627       1 plugin.go:136] found '/usr/local/bin/machine-controller-userdata-ubuntu'
I0412 11:45:15.173650       1 plugin.go:95] looking for plugin "machine-controller-userdata-sles"
I0412 11:45:15.173668       1 plugin.go:123] checking "/usr/local/bin/machine-controller-userdata-sles"
I0412 11:45:15.173692       1 plugin.go:136] found '/usr/local/bin/machine-controller-userdata-sles'
I0412 11:45:15.173716       1 plugin.go:95] looking for plugin "machine-controller-userdata-rhel"
I0412 11:45:15.173733       1 plugin.go:123] checking "/usr/local/bin/machine-controller-userdata-rhel"
I0412 11:45:15.173762       1 plugin.go:136] found '/usr/local/bin/machine-controller-userdata-rhel'
I0412 11:45:15.173783       1 plugin.go:95] looking for plugin "machine-controller-userdata-flatcar"
I0412 11:45:15.173801       1 plugin.go:123] checking "/usr/local/bin/machine-controller-userdata-flatcar"
I0412 11:45:15.173839       1 plugin.go:136] found '/usr/local/bin/machine-controller-userdata-flatcar'
I0412 11:45:15.174111       1 main.go:412] machine controller startup complete
I0412 11:45:15.376242       1 machineset_controller.go:148] Reconcile machineset fervent-leakey-cd8758f7c
I0412 11:45:15.376406       1 status.go:56] Unable to get node for machine fervent-leakey-cd8758f7c-8lhff, machine has no node ref
I0412 11:45:15.376453       1 status.go:56] Unable to get node for machine fervent-leakey-cd8758f7c-8m2mz, machine has no node ref
I0412 11:45:15.376464       1 status.go:56] Unable to get node for machine fervent-leakey-cd8758f7c-fgwb6, machine has no node ref
I0412 11:45:15.573881       1 machine_controller.go:672] Validated machine spec of fervent-leakey-cd8758f7c-8lhff
I0412 11:45:15.587958       1 machine_controller.go:672] Validated machine spec of fervent-leakey-cd8758f7c-fgwb6
I0412 11:45:16.066094       1 machine_controller.go:672] Validated machine spec of fervent-leakey-cd8758f7c-8m2mz
E0412 11:45:17.176609       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 368 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x2e955e0, 0x5581e60)
    k8s.io/apimachinery@v0.19.4/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    k8s.io/apimachinery@v0.19.4/pkg/util/runtime/runtime.go:48 +0x89
panic(0x2e955e0, 0x5581e60)
    runtime/panic.go:969 +0x175
github.com/hetznercloud/hcloud-go/hcloud.(*ServerClient).Create(0xc0009b00d0, 0x3a82ce0, 0xc000058018, 0xc0009b3100, 0x1e, 0xc000d26880, 0xc0001700d0, 0xc00011c758, 0x1, 0x1, ...)
    github.com/hetznercloud/hcloud-go@v1.23.1/hcloud/server.go:306 +0x409
github.com/kubermatic/machine-controller/pkg/cloudprovider/provider/hetzner.(*provider).Create(0xc000182728, 0xc000030f00, 0xc0009a4510, 0xc00112e800, 0x436c, 0x0, 0x0, 0x0, 0x0)
    github.com/kubermatic/machine-controller/pkg/cloudprovider/provider/hetzner/provider.go:284 +0x8b8
github.com/kubermatic/machine-controller/pkg/cloudprovider.(*cachingValidationWrapper).Create(0xc000212780, 0xc000030f00, 0xc0009a4510, 0xc00112e800, 0x436c, 0x0, 0x0, 0x0, 0x35dc201)
    github.com/kubermatic/machine-controller/pkg/cloudprovider/validationwrapper.go:77 +0x5f
github.com/kubermatic/machine-controller/pkg/controller/machine.(*Reconciler).createProviderInstance(0xc000b2d8c0, 0x3aaafe0, 0xc000212780, 0xc000030f00, 0xc00112e800, 0x436c, 0x0, 0x0, 0x0, 0x0)
    github.com/kubermatic/machine-controller/pkg/controller/machine/machine_controller.go:329 +0x137
github.com/kubermatic/machine-controller/pkg/controller/machine.(*Reconciler).ensureInstanceExistsForMachine(0xc000b2d8c0, 0x3a82d60, 0xc0008caea0, 0x3aaafe0, 0xc000212780, 0xc000030f00, 0x3a23ac0, 0xc000989f00, 0xc000339e00, 0x0, ...)
    github.com/kubermatic/machine-controller/pkg/controller/machine/machine_controller.go:706 +0x8bd
github.com/kubermatic/machine-controller/pkg/controller/machine.(*Reconciler).reconcile(0xc000b2d8c0, 0x3a82d60, 0xc0008caea0, 0xc000030f00, 0x55e44c0, 0xc0009b3060, 0x1e)
    github.com/kubermatic/machine-controller/pkg/controller/machine/machine_controller.go:403 +0x7ac
github.com/kubermatic/machine-controller/pkg/controller/machine.(*Reconciler).Reconcile(0xc000b2d8c0, 0x3a82d60, 0xc0008caea0, 0xc0004b71e0, 0xb, 0xc0009b3060, 0x1e, 0xc0008caea0, 0x40a3ff, 0xc00003a000, ...)
    github.com/kubermatic/machine-controller/pkg/controller/machine/machine_controller.go:357 +0x5e8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003399a0, 0x3a82ca0, 0xc00028d500, 0x302d020, 0xc000a2b3a0)
    sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263 +0x317
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003399a0, 0x3a82ca0, 0xc00028d500, 0xc000295600)
    sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x3a82ca0, 0xc00028d500)
    sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
    k8s.io/apimachinery@v0.19.4/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000295750)
    k8s.io/apimachinery@v0.19.4/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000d97f50, 0x3a29520, 0xc0008cae10, 0xc00028d501, 0xc00097dc80)
    k8s.io/apimachinery@v0.19.4/pkg/util/wait/wait.go:156 +0xad
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000295750, 0x3b9aca00, 0x0, 0x1, 0xc00097dc80)
    k8s.io/apimachinery@v0.19.4/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x3a82ca0, 0xc00028d500, 0xc0002126a0, 0x3b9aca00, 0x0, 0x1)
    k8s.io/apimachinery@v0.19.4/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x3a82ca0, 0xc00028d500, 0xc0002126a0, 0x3b9aca00)
    k8s.io/apimachinery@v0.19.4/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:195 +0x4e7
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x22ca249]

It seems to me that some information is not passed properly to the hetzner cloud go client, resulting in a nil pointer.

I could not successfully create a user cluster at hetzner.

Tested with a basic installation of Kubermatic CE in Version 2.16.7 and 2.16.8.

Test seed:

apiVersion: kubermatic.k8s.io/v1
kind: Seed
metadata:
  name: kubermatic
  namespace: kubermatic
spec:
  datacenters:
    hetzner-fsn1:
      country: DE
      location: Falkenstein 1 DC 14
      spec:
        hetzner:
          datacenter: "fsn1-dc14"
    do-ams:
      country: NL
      location: Amsterdam DC3
      spec:
        digitalocean:
          # Datacenter location, e.g. "ams3". A list of existing datacenters can be found
          # at https://www.digitalocean.com/docs/platform/availability-matrix/
          region: "ams3"

kron4eg commented 3 years ago

When you create a usercluster, the corresponding MachineDeployment object(s) will be created in user-cluster kube-api, and in there will be Networks (hetzner) param. Please make sure that the networks ID listed there exist and you could get them with hcloud network describe <ID-FROM-MachineDeployment>.

ewallat commented 3 years ago

Hi @kron4eg,

thank you for your feedback.

Just for my understanding, where is the ID of the network taken from? The network ID was actually missing in the CRD, I added it manually and the nodes were created.

The moment I adjust the pool and increase it for example, the network ID is removed and the machine controller starts crashing again.

I have not had the opportunity to specify a network, is this perhaps more of a dashboard issue?

kron4eg commented 3 years ago

It looks like a complex bug of 2 (or even 3) components at the same time.

1) machine-controller webhook lacks the validation of presence of the network 2) kubermatic fails to provide network to MachineDeployment 3) dashboard fails to make network a required field

kron4eg commented 3 years ago

@ewallat here's the plan.

944 will fix the missing validation in the machine-controller
corresponding kubermatic API issue: kubermatic/kubermatic#6876
in the end, there will be dashboard PR too

ewallat commented 3 years ago

@kron4eg that sounds good.

Regarding the dashboard, I had a look once. There seems to be already the possibility to define networks. It is already merged and will probably come with version 2.17.

Issue: https://github.com/kubermatic/dashboard/issues/3124

kron4eg commented 3 years ago

Yes, it's there, however it's "optional".

shibumi commented 3 years ago

@ewallat can you describe how you fixed this on Hetzner? I am running into the same problem right now.

ewallat commented 3 years ago

Hey @shibumi,

I have defined a standard network in my seed configuration:

spec:
  datacenters:
    hetzner-fsn1:
      country: DE
      location: Falkenstein 1 DC 14
      spec:
        hetzner:
          datacenter: "fsn1-dc14"
          network: "default"
    hetzner-nbg1:
      country: DE
      location: Nürnberg 1 DC 3
      spec:
        hetzner:
          datacenter: "nbg1-dc3"
          network: "default"

I then create only one user cluster per hetzner cloud project and create the network "default" in advance. I use 192.168.0.0/16 as subnet, but you can also use a different one. Instead of default you can probably use network-1 or any other name. It has to match at the end.

In case of default as name you don't have to specify a network in the Kubermatic wizard anymore.

I think the whole thing is just a big workaround, but works quite well for me so far.

kron4eg commented 3 years ago

Yeah, it's a workaround, but the alternative is backward compatibility breaking change (https://github.com/kubermatic/machine-controller/pull/944).

shibumi commented 3 years ago

@kron4eg Oh this is fine. I am not using Hetzner in production yet. Right now we are just using Hetzner for evaluation purposes. Our main goal is to host kubermatic on VMWare.

kubermatic-bot commented 3 years ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubermatic-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kron4eg commented 3 years ago

/remove-lifecycle rotten

kubermatic-bot commented 2 years ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubermatic-bot commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubermatic-bot commented 2 years ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubermatic-bot commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubermatic-bot commented 2 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubermatic-bot commented 2 years ago

@kubermatic-bot: Closing this issue.

In response to [this](https://github.com/kubermatic/machine-controller/issues/943#issuecomment-1187334509): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubermatic / machine-controller

Machine Controller crashes when creating hetzner cloud server #943

944 will fix the missing validation in the machine-controller