The different stages of provisioning should be displayed to the user to
understand where an error happens and be able to react on it.
For this purpose the state of provisioning is queried by looking at
created marker files, whether the Kubernetes API is reachable, and
which nodes have registered itself to the cluster. This is combined
with the return code of lokoctl (or terraform for plain Flatcar).
An additional sanity check is done at the beginning to see if the BMCs
are correctly configured before even attempting a PXE boot.
The user is pointed to error logs and debugging commands (a new ipmi
helper subcommand "diag" can show the server summary), and the user
gets the choice offered to exclude a problematic server so that the
cluster can still come up. Later on the user can see similar
information in a new "racker status" command that shows what was
provisioned and what not did not work.
(Progress is animated with ., .., ...)
$ racker bootstrap --onfailure=exclude -- -provision lokomotive -ip-addrs "$(cat ip_addrs)"
➤ Checking BMC connectivity (35/35)... ✓ done
➤ OS installation via PXE (20/35)... × failed
Failed to provision the following 15 nodes.
11:11:a1:19:fb:22 11:11:da:7f:9d:02 11:11:da:7f:9d:5a […]
You can see logs in /home/core/lokomotive/logs/2021-04-06_16-21-51, run 'ipmi <MAC|DOMAIN> diag' for a short overview of a node, connect to the serial console via 'ipmi <MAC|DOMAIN>', or try to connect via SSH.
Something went wrong, removing 15 nodes from config and retrying 1/3
➤ OS installation via PXE (20/20)... ✓ done
➤ Kubernetes bring-up... ✓ done
➤ Cluster health check (20/20 nodes seen)... ✓ done
➤ Lokomotive component installation... ✓ done
The different stages of provisioning should be displayed to the user to understand where an error happens and be able to react on it. For this purpose the state of provisioning is queried by looking at created marker files, whether the Kubernetes API is reachable, and which nodes have registered itself to the cluster. This is combined with the return code of lokoctl (or terraform for plain Flatcar). An additional sanity check is done at the beginning to see if the BMCs are correctly configured before even attempting a PXE boot. The user is pointed to error logs and debugging commands (a new ipmi helper subcommand "diag" can show the server summary), and the user gets the choice offered to exclude a problematic server so that the cluster can still come up. Later on the user can see similar information in a new "racker status" command that shows what was provisioned and what not did not work.
(Progress is animated with
.
,..
,...
)