tinkerbell / playground

Example deployments of the Tinkerbell Stack for use as playground environments
Apache License 2.0
127 stars 85 forks source link

The tink_worker in terraformed sandbox doesn't get provisioned #132

Closed wokalski closed 1 year ago

wokalski commented 2 years ago

Two disclaimers:

  1. I am still investigating this issue
  2. I am new to all of this but I did follow the guide and tried to do some basic troubleshooting.

After I reboot the tink_worker for the first time it doesn't get provisioned. (after running terraform apply).

My first intuition was a networking issue, especially that I can see a couple of "this doesn't work as it's supposed to" in the terraform file. I'll run a tcpdump on the server on port 67 to check. That said, the network does seem to be set up correctly when I check it on the Equinix Metal portal. It's not a networking issue.

If that's correct, I'm going to tinker in the worker itself, maybe I'm hitting #130? It's a bit odd though, I reran it a couple of times and it consistently didn't work.

Expected Behaviour

Tink-worker connects to the provisioner and one can see the worker under tink workflow events

Current Behaviour

The workflow is stuck in the PENDING state.

Steps to Reproduce (for bugs)

Run the instructions from here

Context

I was just trying to take Tinkerbell for a spin!

Your Environment

Im running it on macOS, I'm using the terraform sandbox with Equinix metal.

wokalski commented 2 years ago

Here are the logs from the boots container:

{"level":"info","ts":1650192223.6683865,"caller":"dhcp4-go@v0.0.0-20190402165401-39c137f31ad3/handler.go:105","msg":"","service":"github.com/tinkerbell/boots","pkg":"dhcp","pkg":"dhcp","event":"recv","mac":"0c:42:a1:97:f6:48","via":"0.0.0.0","iface":"enp2s0f1","xid":"\"3d:45:49:49\"","type":"DHCPDISCOVER","secs":4}
{"level":"info","ts":1650192223.6685946,"caller":"boots/dhcp.go:78","msg":"parsed option82/circuitid","service":"github.com/tinkerbell/boots","pkg":"main","mac":"0c:42:a1:97:f6:48","circuitID":""}
{"level":"info","ts":1650192223.6712575,"caller":"boots/dhcp.go:91","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg":"main","type":"DHCPDISCOVER","mac":"0c:42:a1:97:f6:48","err":"discover from dhcp message: get hardware by mac from tink: rpc error: code = Unknown desc = SELECT: sql: no rows in result set","errVerbose":"rpc error: code = Unknown desc = SELECT: sql: no rows in result set\nget hardware by mac from tink\ngithub.com/tinkerbell/boots/packet.(*client).DiscoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/packet/endpoints.go:108\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:17\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/home/github/go/pkg/mod/github.com/golang/groupcache@v0.0.0-20190702054246-869f871628b6/singleflight/singleflight.go:56\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:19\ngithub.com/tinkerbell/boots/job.CreateFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/job.go:106\nmain.dhcpHandler.serveDHCP\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:89\nmain.dhcpHandler.ServeDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:50\ngithub.com/gammazero/workerpool.(*WorkerPool).dispatch.func1\n\t/home/github/go/pkg/mod/github.com/gammazero/workerpool@v0.0.0-20200311205957-7b00833861c6/workerpool.go:169\nruntime.goexit\n\t/opt/actions-runner/_work/_tool/go/1.16.3/x64/src/runtime/asm_amd64.s:1371\ndiscover from dhcp message"}
{"level":"info","ts":1650192227.7058403,"caller":"dhcp4-go@v0.0.0-20190402165401-39c137f31ad3/handler.go:105","msg":"","service":"github.com/tinkerbell/boots","pkg":"dhcp","pkg":"dhcp","event":"recv","mac":"0c:42:a1:97:f6:48","via":"0.0.0.0","iface":"enp2s0f1","xid":"\"3d:45:49:49\"","type":"DHCPDISCOVER","secs":8}
{"level":"info","ts":1650192227.7061045,"caller":"boots/dhcp.go:78","msg":"parsed option82/circuitid","service":"github.com/tinkerbell/boots","pkg":"main","mac":"0c:42:a1:97:f6:48","circuitID":""}
{"level":"info","ts":1650192227.7088065,"caller":"boots/dhcp.go:91","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg":"main","type":"DHCPDISCOVER","mac":"0c:42:a1:97:f6:48","err":"discover from dhcp message: get hardware by mac from tink: rpc error: code = Unknown desc = SELECT: sql: no rows in result set","errVerbose":"rpc error: code = Unknown desc = SELECT: sql: no rows in result set\nget hardware by mac from tink\ngithub.com/tinkerbell/boots/packet.(*client).DiscoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/packet/endpoints.go:108\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:17\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/home/github/go/pkg/mod/github.com/golang/groupcache@v0.0.0-20190702054246-869f871628b6/singleflight/singleflight.go:56\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:19\ngithub.com/tinkerbell/boots/job.CreateFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/job.go:106\nmain.dhcpHandler.serveDHCP\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:89\nmain.dhcpHandler.ServeDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:50\ngithub.com/gammazero/workerpool.startWorker\n\t/home/github/go/pkg/mod/github.com/gammazero/workerpool@v0.0.0-20200311205957-7b00833861c6/workerpool.go:218\nruntime.goexit\n\t/opt/actions-runner/_work/_tool/go/1.16.3/x64/src/runtime/asm_amd64.s:1371\ndiscover from dhcp message"}
{"level":"info","ts":1650192235.779676,"caller":"dhcp4-go@v0.0.0-20190402165401-39c137f31ad3/handler.go:105","msg":"","service":"github.com/tinkerbell/boots","pkg":"dhcp","pkg":"dhcp","event":"recv","mac":"0c:42:a1:97:f6:48","via":"0.0.0.0","iface":"enp2s0f1","xid":"\"3d:45:49:49\"","type":"DHCPDISCOVER","secs":12}
{"level":"info","ts":1650192235.7798727,"caller":"boots/dhcp.go:78","msg":"parsed option82/circuitid","service":"github.com/tinkerbell/boots","pkg":"main","mac":"0c:42:a1:97:f6:48","circuitID":""}
{"level":"info","ts":1650192235.7824285,"caller":"boots/dhcp.go:91","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg":"main","type":"DHCPDISCOVER","mac":"0c:42:a1:97:f6:48","err":"discover from dhcp message: get hardware by mac from tink: rpc error: code = Unknown desc = SELECT: sql: no rows in result set","errVerbose":"rpc error: code = Unknown desc = SELECT: sql: no rows in result set\nget hardware by mac from tink\ngithub.com/tinkerbell/boots/packet.(*client).DiscoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/packet/endpoints.go:108\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:17\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/home/github/go/pkg/mod/github.com/golang/groupcache@v0.0.0-20190702054246-869f871628b6/singleflight/singleflight.go:56\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:19\ngithub.com/tinkerbell/boots/job.CreateFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/job.go:106\nmain.dhcpHandler.serveDHCP\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:89\nmain.dhcpHandler.ServeDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:50\ngithub.com/gammazero/workerpool.(*WorkerPool).dispatch.func1\n\t/home/github/go/pkg/mod/github.com/gammazero/workerpool@v0.0.0-20200311205957-7b00833861c6/workerpool.go:169\nruntime.goexit\n\t/opt/actions-runner/_work/_tool/go/1.16.3/x64/src/runtime/asm_amd64.s:1371\ndiscover from dhcp message"}
{"level":"info","ts":1650192251.8749926,"caller":"dhcp4-go@v0.0.0-20190402165401-39c137f31ad3/handler.go:105","msg":"","service":"github.com/tinkerbell/boots","pkg":"dhcp","pkg":"dhcp","event":"recv","mac":"0c:42:a1:97:f6:48","via":"0.0.0.0","iface":"enp2s0f1","xid":"\"3d:45:49:49\"","type":"DHCPDISCOVER","secs":16}
{"level":"info","ts":1650192251.8751898,"caller":"boots/dhcp.go:78","msg":"parsed option82/circuitid","service":"github.com/tinkerbell/boots","pkg":"main","mac":"0c:42:a1:97:f6:48","circuitID":""}
{"level":"info","ts":1650192251.8778894,"caller":"boots/dhcp.go:91","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg":"main","type":"DHCPDISCOVER","mac":"0c:42:a1:97:f6:48","err":"discover from dhcp message: get hardware by mac from tink: rpc error: code = Unknown desc = SELECT: sql: no rows in result set","errVerbose":"rpc error: code = Unknown desc = SELECT: sql: no rows in result set\nget hardware by mac from tink\ngithub.com/tinkerbell/boots/packet.(*client).DiscoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/packet/endpoints.go:108\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:17\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/home/github/go/pkg/mod/github.com/golang/groupcache@v0.0.0-20190702054246-869f871628b6/singleflight/singleflight.go:56\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:19\ngithub.com/tinkerbell/boots/job.CreateFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/job.go:106\nmain.dhcpHandler.serveDHCP\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:89\nmain.dhcpHandler.ServeDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:50\ngithub.com/gammazero/workerpool.(*WorkerPool).dispatch.func1\n\t/home/github/go/pkg/mod/github.com/gammazero/workerpool@v0.0.0-20200311205957-7b00833861c6/workerpool.go:169\nruntime.goexit\n\t/opt/actions-runner/_work/_tool/go/1.16.3/x64/src/runtime/asm_amd64.s:1371\ndiscover from dhcp message"}
wokalski commented 2 years ago

Ok, I'm super confused. I understand where it's coming from:

The hardware spec inserted into the database is hardcoded rather than dynamically generated based on the worker. I believe it's going to work after I insert a proper hardware definition.

It is however, super unclear looking at the docs that I should do that, and honestly, it probably could be terraformed, too? It's too glaring of an oversight to be real; I must've overlooked something in the docs, but not sure what.

mmlb commented 2 years ago

This does sound very not right :D, can you retry but using the code from #126 ?

wokalski commented 2 years ago

I haven't used it, adding a correct hardware definition did work. I can't see how your PR fixes it though; it doesn't change anything about the hardware definitions, they are not converted into templates (as they should be).

My gut feeling is that somehow someone got it working for them consistently because the MAC seems to be not-so-random. When I created and destroyed the worker multiple times IIRC it got the same MAC.

mmlb commented 2 years ago

I haven't used it, adding a correct hardware definition did work. I can't see how your PR fixes it though; it doesn't change anything about the hardware definitions, they are not converted into templates (as they should be).

My gut feeling is that somehow someone got it working for them consistently because the MAC seems to be not-so-random. When I created and destroyed the worker multiple times IIRC it got the same MAC.

I've used the tf setup a bunch on w/e machines EM ends up provisioning so there's no way a MAC stays the same. It gets updated here https://github.com/tinkerbell/sandbox/blob/main/deploy/compose/create-tink-records/create.sh#L20-L28. This happens (in my branch) by way of:

  1. terraform creates cloud-config userdata and populates the WORKER_MAC using the data from the api (https://github.com/mmlb/tinkerbell-sandbox/blob/terraform-love/deploy/terraform/main.tf#L103)
  2. which then runs setup.sh which overrides the mac in the .env file https://github.com/mmlb/tinkerbell-sandbox/blob/terraform-love/deploy/terraform/setup.sh#L163 -> https://github.com/mmlb/tinkerbell-sandbox/blob/terraform-love/deploy/terraform/setup.sh#L116-L127
  3. so that when docker-compose up is run it will pick up the worker's mac address (https://github.com/mmlb/tinkerbell-sandbox/blob/terraform-love/deploy/compose/docker-compose.yml#L191-L197) and update the hardware description before feeding it into tink https://github.com/mmlb/tinkerbell-sandbox/blob/terraform-love/deploy/compose/create-tink-records/create.sh#L20-L28
wokalski commented 2 years ago

Indeed, my bad! Ok, it definitely didn't work on master for some reason. I hope your branch has a fix for it.

micahhausler commented 2 years ago

@wokalski did #126 fix things for you?

wokalski commented 2 years ago

I didnt test it. I made it work with my local tweaks. I hope it does though !

CAcquaviva commented 2 years ago

Hello @wokalski , I'm trying to do exactly the same thing: running terraform sandbox from my macOS to spin up Provisioner and Worker with Equinix metal but the workflow stucks in the PENDING state. Pls, can you share your tweaks? TIA.

wokalski commented 2 years ago

@CAcquaviva I did make it work but I didn't end up productizing this setup. Tinker bell undergoing a huge transition when it comes to internals. The issue you're hitting is most likely:

  1. Certs don't match between the registry and the worker (you can see it on the worker machine in /var/log/bootkit if it's the case)
  2. Or if that step worked out then it most likely doesn't work because the tink worker :latest is not compatible with sandbox. You need to pin a correct (older) version. From a couple of months ago (try 1.5 months ago or so)

If you are thinking about creating a production setup using tinker bell and you have a small network I'd encourage you to take a look at matchbox from Poseidon. I really like the architecture of tinker bell but it's just too much work in progress now in my opinion.

chrisdoherty4 commented 1 year ago

The project has moved on quite a bit since the issue was raised, namely we no longer use the Postgres backend and the tink CLI has been deprecated.

This may still be an issue but its unclear what the next steps are. We'll take an action to validate the Terraform setup separately and raise issues as needed.