ncsa / xcat-tools

Useful tools for xCAT
BSD 3-Clause "New" or "Revised" License
8 stars 0 forks source link

Improve backup-node_configs.sh to be aware of xcat status of node #60

Open billglick opened 5 months ago

billglick commented 5 months ago

We ran into an issue today where a node had been partially deployed after/during some hardware diagnostics, and the daily backup-node_configs.sh script backed up incorrect, broken Puppet client configs and SSL certificates. Then additional reboots of this server resulted in node-config restores restoring the broken Puppet configs and SSL certificates.

A couple of recommendations to resolve this and make it easier to recover a node from this issue:

  1. Update the backup-node_configs.sh script to only backup the node if xCAT reports the node's status is booted. If nodes are in a sit-and-spin type postscript their status should instead be postbooting, and if a postscript exits with a bad error code its status should be failed. So checking for a status of booted should allow us to only take backups if the node is in the expected state.
  2. Would it be possible to keep one old version of each node's various node_configs backup files on the xCAT server? I don't have a proposed solution to implement this, but being able to at least see what changed with a node-configs backup file in the last few days would be helpful, as well as being able to restore old versions of the node-configs backup files.
billglick commented 5 months ago

I'm somewhat surprised we hadn't run into this issue before. A couple of other examples of when this might be an issue include: