TestE2ECoreHA sometimes fails, because it reboots one node, then waits for a different node to become healthy. It will then reboot the next node very soon, before the previous node has come back up.
This happens because when launching the test cluster, node IDs are appended to the NodeIDs slice in whatever order they appear as new, which may not be the order in which the qemu processes were started.
To fix this, we need some way to find out which qemu process corresponds to which node ID. One way to do this would be to assign a serial number to each qemu process (e.g. -smbios type=1,serial=node1), and then retrieve it via metropolis API. However, there is currently no way to obtain the serial number through the API. The existing hwreport contains the serial number, but that is currently specific to the cloud agent. We could add a new rpc to the NodeManagement service for getting a hwreport.
TestE2ECoreHA sometimes fails, because it reboots one node, then waits for a different node to become healthy. It will then reboot the next node very soon, before the previous node has come back up.
This happens because when launching the test cluster, node IDs are appended to the
NodeIDs
slice in whatever order they appear as new, which may not be the order in which the qemu processes were started.To fix this, we need some way to find out which qemu process corresponds to which node ID. One way to do this would be to assign a serial number to each qemu process (e.g.
-smbios type=1,serial=node1
), and then retrieve it via metropolis API. However, there is currently no way to obtain the serial number through the API. The existing hwreport contains the serial number, but that is currently specific to the cloud agent. We could add a new rpc to the NodeManagement service for getting a hwreport.