voxpupuli / puppet-nomad

Puppet module for managing Nomad
Apache License 2.0
16 stars 32 forks source link

Nomad server recover helper #77

Closed maxadamo closed 1 year ago

maxadamo commented 1 year ago

Affected Puppet, Ruby, OS and module versions/distributions

n/a

How to reproduce

bring down the nomad daemon on all your nomad servers

What are you seeing

you won't be able to restart the daemon

What behaviour did you expect instead

have a procedure, a script, or a Bolt task

Output log

n/a

Proposed solution

Recovering from outage, is a time consuming operation, but it can be partially automated. The manifest below creates a script which can be run from the servers to recovery the cluster. If you have PuppetDB you can use nomad_server_regex otherwise you need to pre-fill a hash and use nomad_server_hash.

# This class is used to generate a peers.json and a recovery script file for Nomad servers.
# It is used to recover from a Nomad server outage.
#
# @example using PuppetDB
#  class { 'geant_nomad::server::peer_json':
#    nomad_server_regex => 'nomad-server0',
#    iface              => 'eth0',
#  }
#
# @example using a Hash
#  class { 'geant_nomad::server::peer_json':
#    nomad_server_hash => {
#      '192.168.1.10' => 'a1b2c3d4-1234-5678-9012-3456789abcde',
#      '192.168.1.10' => 'a1b2c3d4-1234-5678-9012-3456789abcde',
#    },
#    iface              => 'eth0',
#  }
#
# @param nomad_server_regex
#  Regex to match Nomad server hostnames within the same puppet environment
# @param nomad_server_regex
#  If you don't have the PuppetDB you can supply a Hash with server IPs and corresponding node-ids
# @param iface
#  NIC where Nomad server IP is configured
# @param port
#  Nomad server port
#
class nomad::servier_recovery (
  String $iface,
  Optional[String] $nomad_server_regex = undef,
  Optional[Hash] $nomad_server_hash    = undef,
  Stdlib::Port $port                   = 4647,
) {
  if ($nomad_server_regex) and ($nomad_server_hash) {
    fail('You can only use one of the parameters: nomad_server_regex or nomad_server_hash')
  }
  elsif !($nomad_server_regex) and !($nomad_server_hash) {
    fail('You must use one of the parameters: nomad_server_regex or nomad_server_hash')
  }
  if ($facts['nomad_node_id']) {
    if ($nomad_server_regex) {
      $nomad_server_inventory = puppetdb_query(
        "inventory[facts.networking.hostname, facts.networking.interfaces.${iface}.ip, facts.nomad_node_id] {
          facts.networking.hostname ~ '${nomad_server_regex}' and facts.agent_specified_environment = '${facts['agent_specified_environment']}'
        }"
      )
      $nomad_server_pretty_inventory = $nomad_server_inventory.map |$item| {
        {
          'id' => $item['facts.nomad_node_id'],
          'address' => "${item["facts.networking.interfaces.${iface}.ip"]}:${port}",
          'non_voter' => false
        }
      }
    } else {
      if $nomad_server_hash.keys() !~ Stdlib::IP::Address::Nosubnet {
        fail('The keys of the nomad_server_hash parameter must be valid IP addresses')
      }
      $nomad_server_pretty_inventory = $nomad_server_hash.map |$key, $value| {
        {
          'id' => $value,
          'address' => "${key}:${port}",
          'non_voter' => false
        }
      }
    }

    file {
      default:
        owner => 'root',
        group => 'root';
      '/tmp/peers.json':
        mode    => '0640',
        content => to_json_pretty($nomad_server_pretty_inventory);
      '/usr/local/bin/nomad-server-outage-recover.sh':
        mode    => '0750',
        content => "#!/bin/bash
PATH=/bin:/usr/bin:/sbin:/usr/sbin
systemctl stop nomad.service
install -o root -g root -m 644 /tmp/peers.json /var/lib/nomad/server/raft/peers.json
systemctl start nomad.service\n";
    }
  }
}
maxadamo commented 1 year ago

ideally we could create a Bolt task to trigger the execution of the script

bastelfreak commented 1 year ago

@sebastianrakel @attachmentgenie do you have some thoughts here? maybe we sould have a bolt plan for this?

attachmentgenie commented 1 year ago

I also feel a bolt task would be more appropriate, in a meltdown situation i dont see anyone changing and pushing hiera changes in an emergency.

maxadamo commented 1 year ago

@attachmentgenie the idea is to create the file peers.json, pulling the data from PuppetDB (and fall-back to hiera only if you miss PuppetDB), and not only when you need it. The file will always be there, ready to be used.
It's gonna be the same with Bolt, but if you don't have the puppetDB it's even worse with Bolt, because you'll need to input all the data when you are in a meltdown situation, and it's gonna be easier to create the peers.json manually.

IMO the Bolt plan is eventually an addition to the puppet manifests. And if you don't have the PuppetDB I would recommend to fill in the data in advance.

maxadamo commented 1 year ago

@attachmentgenie are you also good with the change, and is it clear how it works? If all your servers are down, you just to run: /usr/local/bin/nomad-server-outage-recover.sh on your nomad servers (not the agents). EOS I can merge it straight away, and I am asking because I already started working on the next PR to fix #84