Closed varilci closed 1 year ago
I can confirm that I found similar behavior, I'm running on Ubuntu 21.10 , docker 20.10.14 with docker-compose 2.4.1. In my testing Boots picked up each request though but the workflow never starts after booting in the LinuxKit environment. I verified that the hardware config, template and workflow are in place with the correct details. I found the following error in the docker registry log:
compose-web-assets-server-1 | 10.180.5.155 - - [21/Apr/2022:09:47:29 +0000] "GET /workflow/ca.pem HTTP/1.1" 200 623 "-" "Go-http-client/1.1" "-"
compose-registry-1 | 10.180.5.87 - - [21/Apr/2022:09:47:30 +0000] "GET / HTTP/1.1" 200 0 "" "Wget"
compose-registry-1 | 2022/04/21 09:47:32 http: TLS handshake error from 10.180.5.155:41030: remote error: tls: bad certificate
Hi @digi-talo and @varilci, did you tried the Vagrand and VirtualBox Tutorials too? I have the exact same issues there :D And could not figure it out why... Regarding to https://github.com/tinkerbell/boots/issues/125 this should be fixed.
Edited: This behaviour could be replicated with the Vagrand and libvirt Quickstart too.
@varilci at that kvm console, can you check the tink-worker logs please:
(press enter, there IS a prompt "hidden" with the kernel message along IPVS)
# ctr -n services.linuxkit task exec -t --exec-id testing docker sh
# cd /var/lib/docker/containers/*
# less *json
Is there an X509 unable to verify message?
@double-p I took the liberty to check myself but found a different error:
yeah, that's the one. What I found so far, that this is likely related to https://github.com/tinkerbell/tink/pull/584
basically the whole thing wont work with "unofficial"/selfsigned certs any longer.
In sandbox
I did some dial backs, but I think using tink-worker:sha-453f0fda should already do the trick.
I gave it a shot with tink-worker:sha-453f0fda
but that doesn't seem to take me any further :disappointed:
Ok, the "full" change to make my sandbox-vagrant-vbox setup going is with these hashes and a change on the Boots cmdline, since they changed somewhen around that. Might be more than needed, but "it works for me". https://gist.github.com/double-p/ea7597da76956fac6f90251ad2e2f175
Hi @double-p, your setup for VBox and Vagrant works fine 👍 Only provisioning with libvirt does not work... Trying libvirt the same error as @varilci mentioned occur. Very strange behavior... But thanks a lot for you help.
Edit:
The only problem is now, if you try to reload the machine via vagrant reload machine1
it stucks at Initializing XFRM netlink socket
again. And ssh via ssh tink@192.168.50.43
does not work too.
There's a second reason this can hang (just when hook tries to download tink-worker) and that's along "ca.pem". Remove that after failed tries before starting the next up/provision of the "provisioner".
ssh from the provisioner will not work before you also see a normal getty/login screen on the machine1 GUI, too
@double-p thanks for all the help. I managed the provision a VM once with the docker-compose setup but I can't seem to get this to work reliably.
It just hangs at:
While I notice the following in the docker logs:
web-assets-server_1 | 10.180.5.125 - - [02/May/2022:09:13:19 +0000] "GET /workflow/ca.pem HTTP/1.1" 200 623 "-" "Go-http-client/1.1" "-"
registry_1 | time="2022-05-02T09:13:22.311102921Z" level=warning msg="error authorizing context: basic authentication challenge for realm "Registry Realm": invalid authorization credential" go.version=go1.11.2 http.request.host=10.180.5.87 http.request.id=8a452cdb-edfe-455c-bd6f-4acff898c547 http.request.method=GET http.request.remoteaddr="10.180.5.125:48606" http.request.uri="/v2/" http.request.useragent="docker/19.03.8 go/go1.12.17 git-commit/afacb8b7f0 kernel/5.10.57-linuxkit os/linux arch/amd64 UpstreamClient(Go-http-client/1.1)
You mention:
There's a second reason this can hang (just when hook tries to download tink-worker) and that's along "ca.pem". Remove that after failed tries before starting the next up/provision of the "provisioner".
Where do you exactly remove the ca.pem
?
Hmm, can't tell why/where the 2nd attempt hangs at. Check logs of tink-docker again.
ca.pem is in deploy/compose/state/webroot/workflow
after the initial provisioning - only delete it for a fresh start, not a subsequent deployment
Not sure what config file this refers to?
That's an OPTIONAL config (go viper) of tink-worker, since all that (as I understand) via ENV and cmdline-ops, it's not needed and not really an error. Basically the worker is started (you should see something about that in tink-server logs) and waits to get a workflow assigned and work it.
Basically, no more certificate errors and also no networking issues (there would be a connection error or timeout after "starting" in that log).
so.. check tink-server and 'tink workflow get' and so on
No idea what changed but I tried it again today on a clean environment and I'm stuck again with the bad certificate. The changes of @double-p no longer seem to do the trick :disappointed:
the sha-references are immutable so that's very weird. cannot try before weekend since @ kubecon
Yeah you are right, I found that you'll still get the bad cert error if the systems clock deviates too much of the actual time.
Ah.. that was "fixed" in a newer version -- which just introduced other problems.
You can also adapt that in deploy/compose/generate-tls-certs/ca-{config,csr}.json
if a clock fix is difficult
Greetings,
I am trying to "play" with Tinkerbell to provision bare metal servers with OSes (e.g with Ubuntu Focal). I am following https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md and I haven't been able to finish the provisioning steps -I've re-tried the steps 5 times in a row, and saw 2 different outcomes-. On my first 3 tries, Boots can't recognize the DHCP request and logs the info written below. On my 4th try, it picked it up, but the workflow's action state stayed/stuck in
STATE_PENDING
state -I've left it like that for 2 hours, and I think that is long enough time for it to at least start working-. Then I tried it one more time, and it didn't get pick up by Boots like the first 3 tries.Any recommendation/tips are welcome. If you have more known ways of making Tinkerbell work -a bare metal server provisioning abother one with an OS-, I am also willing to give a try to them.
Expected Behaviour
I am expecting to see similar outcome for the steps described in https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md
Current Behaviour
Either Boots doesn't pick up the machine, or it gets picked up but the workflow stays in
0%, STATE_ENDING
stage.# echo $TINKERBELL_CLIENT_MAC e4:43:4b:3d:75:b8
, I encountered the following output in Boots logs and machine doesn't picked up by Tinkerbell stack;Step 6
of the linked guide above;On the KVM screen, this screenshot was visible -on the 4th run when Boots was able to pick up the server's request-, and during that 2 hours it didn't change;
Possible Solution
N/A
Steps to Reproduce (for bugs)
Context
Your Environment
Ubuntu 20.04.4 LTS
Bare metal provisioner, trying to provision another bare metal server with
docker-compose
method outlined in https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.mdN/A