threefoldtech / 0-bootstrap

Zero-OS Bootstrap Webservice
Apache License 2.0
1 stars 2 forks source link

Zos bootstrap hangs #19

Closed scottyeager closed 1 year ago

scottyeager commented 1 year ago

Farmers sometimes report that the bootstrapping process hangs at Downloading Zero-OS image.... While this is often resolved by rebooting the node, a reboot doesn't always help.

One farmer noticed this behavior when attempting to boot multiple nodes simultaneously:

I found an interesting quirk where if I took a running node offline, then a node I was having trouble with getting zOS to download to 100% would breeze right through.

Said working node that I took offline would then halt at some point in the zOS download process. Rinse and repeat with other nodes at random. I'm ultimately left with one node that will stall.

I just now remoted into my one stubborn node via iLO Console and did a reboot. Loaded right up.

With WoL coming, having nodes stuck in this state will be a much bigger problem. Although the root cause may indeed be within the farmer's network or at the flist hub, the bootstrap code should at least be able to recover via some kind of timeout and retry in cases where a manual reboot is sufficient.

muhamadazmy commented 1 year ago

Unfortunately this issue is not zos related. The initial download process is done by the node BIOS (ipxe). Hence this can be duo to one of the following causes:

In all cases this can't be zos issue because zos was not even booted yet. Hence i will move it to 0-initramfs

maxux commented 1 year ago

Kernel are served by the bootstrap service, which for kernel part, only forward the binary. I'll try to move that serving from Flask to Caddy on front to offload the web service, I have no clue yet if that could helps but worth a try. I'll keep you in touch when it's done.

Otherwise yes it's more probably client side issue but can be related to ipxe not supporting well the firmware, I can update ipxe for some client to see if some improvements are made.

maxux commented 1 year ago

It's pushed into production, kernel are served by frontend Caddy now, let's see if there is any improvement :)

scottyeager commented 1 year ago

Great, thanks @maxux. I'll keep an eye on reports from farmers to see if this helps, and also suggest ensuring that firmware is up to date.

maxux commented 1 year ago

@scottyeager Nothing new regarding this ? Can I close the issue ? :)

scottyeager commented 1 year ago

No reports of this issue lately. I'll close. Thanks @maxux.