Can't complete `docker_compose` quickstart guide

varilci commented 2 years ago

Greetings,

I am trying to "play" with Tinkerbell to provision bare metal servers with OSes (e.g with Ubuntu Focal). I am following https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md and I haven't been able to finish the provisioning steps -I've re-tried the steps 5 times in a row, and saw 2 different outcomes-. On my first 3 tries, Boots can't recognize the DHCP request and logs the info written below. On my 4th try, it picked it up, but the workflow's action state stayed/stuck in STATE_PENDING state -I've left it like that for 2 hours, and I think that is long enough time for it to at least start working-. Then I tried it one more time, and it didn't get pick up by Boots like the first 3 tries.

Any recommendation/tips are welcome. If you have more known ways of making Tinkerbell work -a bare metal server provisioning abother one with an OS-, I am also willing to give a try to them.

Expected Behaviour

I am expecting to see similar outcome for the steps described in https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md

Current Behaviour

Either Boots doesn't pick up the machine, or it gets picked up but the workflow stays in 0%, STATE_ENDING stage.

Boots doesn't recognize the DHCP request. For # echo $TINKERBELL_CLIENT_MAC e4:43:4b:3d:75:b8, I encountered the following output in Boots logs and machine doesn't picked up by Tinkerbell stack;

boots_1                     | {"level":"info","ts":1649949865.6316814,"caller":"dhcp4-go@v0.0.0-20190402165401-39c137f31ad3/handler.go:105","msg":"","service":"github.com/ti
nkerbell/boots","pkg":"dhcp","pkg":"dhcp","event":"recv","mac":"e4:43:4b:3d:75:b8","via":"0.0.0.0","iface":"eno1","xid":"\"4b:3d:75:b8\"","type":"DHCPDISCOVER","secs":28}   boots_1                     | {"level":"info","ts":1649949865.6318014,"caller":"boots/dhcp.go:78","msg":"parsed option82/circuitid","service":"github.com/tinkerbell/boots","
pkg":"main","mac":"e4:43:4b:3d:75:b8","circuitID":""}                                                                                                                        boots_1                     | {"level":"info","ts":1649949865.6336043,"caller":"boots/dhcp.go:91","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg
":"main","type":"DHCPDISCOVER","mac":"e4:43:4b:3d:75:b8","err":"discover from dhcp message: get hardware by mac from tink: rpc error: code = Unknown desc = SELECT: sql: no rows in result set","errVerbose":"rpc error: code = Unknown desc = SELECT: sql: no rows in result set\nget hardware by mac from tink\ngithub.com/tinkerbell/boots/packet.(*client).DiscoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/packet/endpoints.go:108\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:17\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/home/github/go/pkg/mod/github.com/golang/groupcache@v0.0.0-20190702054246-869f871628b6/singleflight/singleflight.go:56\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:19\ngithub.com/tinkerbell/boots/job.CreateFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/job.go:106\nmain.dhcpHandler.serveDHCP\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:89\nmain.dhcpHandler.ServeDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:50\ngithub.com/gammazero/workerpool.startWorker\n\t/home/github/go/pkg/mod/github.com/gammazero/workerpool@v0.0.0-20200311205957-7b00833861c6/workerpool.go:218\nruntime.goexit\n\t/opt/actions-runner/_work/_tool/go/1.16.3/x64/src/runtime/asm_amd64.s:1371\ndiscover from dhcp message"}

Workflow doesn't progress. On my 4th try, the machine got picked up by Boots, but then the workflow got "stuck", and it was visible during Step 6 of the linked guide above;

Every 1.0s: tink workflow events c263defc-c0b1-11ec-9ab9-0242ac130006; tink workflow state c263defc-c0b1-11ec-9ab9-0242ac130006 

+-----------+-----------+-------------+----------------+---------+---------------+
| WORKER ID | TASK NAME | ACTION NAME | EXECUTION TIME | MESSAGE | ACTION STATUS |
+-----------+-----------+-------------+----------------+---------+---------------+
+-----------+-----------+-------------+----------------+---------+---------------+
+----------------------+--------------------------------------+
| FIELD NAME           | VALUES                               |
+----------------------+--------------------------------------+
| Workflow ID          | c263defc-c0b1-11ec-9ab9-0242ac130006 |
| Workflow Progress    | 0%                                   |
| Current Task         |                                      |
| Current Action       |                                      |
| Current Worker       |                                      |
| Current Action State | STATE_PENDING                        |
+----------------------+--------------------------------------+

On the KVM screen, this screenshot was visible -on the 4th run when Boots was able to pick up the server's request-, and during that 2 hours it didn't change;

Possible Solution

N/A

Steps to Reproduce (for bugs)

Follow the https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md guide.
Process gets stuck on Step 6, either Boots can't recognize the DHCP request being sent by the defined MAC address in Step 3 or it gets picked up, but the workflow doesn't progress and provision the machine.

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):

Ubuntu 20.04.4 LTS

How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:

Bare metal provisioner, trying to provision another bare metal server with docker-compose method outlined in https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md

Link to your project or a code example to reproduce issue:

N/A

digi-talo commented 2 years ago

I can confirm that I found similar behavior, I'm running on Ubuntu 21.10 , docker 20.10.14 with docker-compose 2.4.1. In my testing Boots picked up each request though but the workflow never starts after booting in the LinuxKit environment. I verified that the hardware config, template and workflow are in place with the correct details. I found the following error in the docker registry log:

compose-web-assets-server-1              | 10.180.5.155 - - [21/Apr/2022:09:47:29 +0000] "GET /workflow/ca.pem HTTP/1.1" 200 623 "-" "Go-http-client/1.1" "-"
compose-registry-1                       | 10.180.5.87 - - [21/Apr/2022:09:47:30 +0000] "GET / HTTP/1.1" 200 0 "" "Wget"
compose-registry-1                       | 2022/04/21 09:47:32 http: TLS handshake error from 10.180.5.155:41030: remote error: tls: bad certificate

JulianHBuecher commented 2 years ago

Hi @digi-talo and @varilci, did you tried the Vagrand and VirtualBox Tutorials too? I have the exact same issues there :D And could not figure it out why... Regarding to https://github.com/tinkerbell/boots/issues/125 this should be fixed.

Edited: This behaviour could be replicated with the Vagrand and libvirt Quickstart too.

double-p commented 2 years ago

@varilci at that kvm console, can you check the tink-worker logs please:

(press enter, there IS a prompt "hidden" with the kernel message along IPVS)
# ctr -n services.linuxkit task exec -t --exec-id testing docker sh
# cd /var/lib/docker/containers/*
# less *json

Is there an X509 unable to verify message?

digi-talo commented 2 years ago

@double-p I took the liberty to check myself but found a different error:

double-p commented 2 years ago

yeah, that's the one. What I found so far, that this is likely related to https://github.com/tinkerbell/tink/pull/584

basically the whole thing wont work with "unofficial"/selfsigned certs any longer.

In sandbox I did some dial backs, but I think using tink-worker:sha-453f0fda should already do the trick.

digi-talo commented 2 years ago

I gave it a shot with tink-worker:sha-453f0fda but that doesn't seem to take me any further :disappointed:

double-p commented 2 years ago

Ok, the "full" change to make my sandbox-vagrant-vbox setup going is with these hashes and a change on the Boots cmdline, since they changed somewhen around that. Might be more than needed, but "it works for me". https://gist.github.com/double-p/ea7597da76956fac6f90251ad2e2f175

JulianHBuecher commented 2 years ago

Hi @double-p, your setup for VBox and Vagrant works fine 👍 Only provisioning with libvirt does not work... Trying libvirt the same error as @varilci mentioned occur. Very strange behavior... But thanks a lot for you help.

Edit: The only problem is now, if you try to reload the machine via vagrant reload machine1 it stucks at Initializing XFRM netlink socket again. And ssh via ssh tink@192.168.50.43 does not work too.

double-p commented 2 years ago

There's a second reason this can hang (just when hook tries to download tink-worker) and that's along "ca.pem". Remove that after failed tries before starting the next up/provision of the "provisioner".

ssh from the provisioner will not work before you also see a normal getty/login screen on the machine1 GUI, too

digi-talo commented 2 years ago

@double-p thanks for all the help. I managed the provision a VM once with the docker-compose setup but I can't seem to get this to work reliably.

It just hangs at:

While I notice the following in the docker logs:

web-assets-server_1              | 10.180.5.125 - - [02/May/2022:09:13:19 +0000] "GET /workflow/ca.pem HTTP/1.1" 200 623 "-" "Go-http-client/1.1" "-"
registry_1                       | time="2022-05-02T09:13:22.311102921Z" level=warning msg="error authorizing context: basic authentication challenge for realm "Registry Realm": invalid authorization credential" go.version=go1.11.2 http.request.host=10.180.5.87 http.request.id=8a452cdb-edfe-455c-bd6f-4acff898c547 http.request.method=GET http.request.remoteaddr="10.180.5.125:48606" http.request.uri="/v2/" http.request.useragent="docker/19.03.8 go/go1.12.17 git-commit/afacb8b7f0 kernel/5.10.57-linuxkit os/linux arch/amd64 UpstreamClient(Go-http-client/1.1)

You mention:

There's a second reason this can hang (just when hook tries to download tink-worker) and that's along "ca.pem". Remove that after failed tries before starting the next up/provision of the "provisioner".

Where do you exactly remove the ca.pem?

double-p commented 2 years ago

Hmm, can't tell why/where the 2nd attempt hangs at. Check logs of tink-docker again.

ca.pem is in deploy/compose/state/webroot/workflow after the initial provisioning - only delete it for a fresh start, not a subsequent deployment

digi-talo commented 2 years ago

Not sure what config file this refers to?

double-p commented 2 years ago

That's an OPTIONAL config (go viper) of tink-worker, since all that (as I understand) via ENV and cmdline-ops, it's not needed and not really an error. Basically the worker is started (you should see something about that in tink-server logs) and waits to get a workflow assigned and work it.

Basically, no more certificate errors and also no networking issues (there would be a connection error or timeout after "starting" in that log).

so.. check tink-server and 'tink workflow get' and so on

digi-talo commented 2 years ago

No idea what changed but I tried it again today on a clean environment and I'm stuck again with the bad certificate. The changes of @double-p no longer seem to do the trick :disappointed:

double-p commented 2 years ago

the sha-references are immutable so that's very weird. cannot try before weekend since @ kubecon

digi-talo commented 2 years ago

Yeah you are right, I found that you'll still get the bad cert error if the systems clock deviates too much of the actual time.

double-p commented 2 years ago

Ah.. that was "fixed" in a newer version -- which just introduced other problems.

You can also adapt that in deploy/compose/generate-tls-certs/ca-{config,csr}.json if a clock fix is difficult

tinkerbell / playground