linux-system-roles / nbde_client

Ansible role for configuring Network Bound Disk Encryption clients (e.g. clevis)
https://linux-system-roles.github.io/nbde_client/
MIT License
14 stars 24 forks source link

New system not able to unlock after running role #150

Open aheath1992 opened 9 months ago

aheath1992 commented 9 months ago

New system is unable to unlock after running the nbde_client role, after running the role get an all good from Ansible but upon reboot the system stops at the Luks encryption screen.

    - name: Import nbde_client role
      ansible.builtin.import_role:
        name: linux-system-roles.nbde_client
      vars:
        nbde_client_bindings:
          - device: "{{ root_disk | d('/dev/vda2') }}"
            encryption_password: "{{ current_password }}"
            servers: "{{ tang_servers }}"

Screenshot from 2024-02-01 16-02-08

richm commented 9 months ago

What version of the role are you using? What version of ansible are you using? What is the platform/version of your control node? What is the platform/version of your managed node? @sergio-correia what other debugging information do we need?

aheath1992 commented 9 months ago

What version of the role are you using - 1.71.1 What version of ansible are you using - 2.16 What is the platform/version of your control node - fedora 39 What is the platform/version of your managed node - RHEL 8.8

sergio-correia commented 9 months ago

Hello. Here are some more info that may be helpful to debug this:

@richm: I wonder if it makes sense to have some "action"/"state" to collect some of these information from the managed hosts, to help troubleshooting such issues?

aheath1992 commented 9 months ago
sergio-correia commented 9 months ago

At a first glance, it looks OK -- could you also check `journalctl , to see if any useful information shows up, please? (I forgot to mention beforehand, but feel free to redact any IP addresses, if required)

journalctl -xf -u clevis-luks-askpass.service

aheath1992 commented 9 months ago

Feb 02 15:21:20 clevis-test.ansi-001.prod.iad2.dc.redhat.com clevis-luks-askpass[11941]: Error communicating with the server http://tang1 Feb 02 15:21:20 clevis-test.ansi-001.prod.iad2.dc.redhat.com clevis-luks-askpass[11942]: Error communicating with the server http://tang2

telnet tang1 80 Trying tang1... Connected to tang1. Escape character is '^]'.

telnet tang2 80 Trying tang2... Connected to tang2. Escape character is '^]'.

richm commented 3 months ago

@sergio-correia any idea?

xeluior commented 3 months ago

I've been seeing this as well. I have found that adding the _netdev option to the relevant fstab entry allows the unlocking to proceed (tested on Rocky 8 and 9 clients, both early and late boot, and Debian 11 and 12 clients, late boot only). I have added an awk script task into my playbook after the role runs to add this option.

- name: Update fstab options
  ansible.builtin.shell: |
    name="$(awk '$2 == "{{ item.device }}" { print $1 }' /etc/crypttab | head -n 1)"
    awk -v mapper_path="/dev/mapper/$name" '{
      if ($1 == mapper_path && index($4, "_netdev") == 0) {
        $4 = $4 ",_netdev"
      }
      print
    }' /etc/fstab > /tmp/fstab
    diff -q /tmp/fstab /etc/fstab || echo changed
    mv /tmp/fstab /etc/fstab
  loop: '{{ nbde_client_bindings }}'
  register: fstab
  changed_when: '"changed" in fstab.stdout'

I believe this behavior is tied to systemd's ordering of mount units, that is, it orders fstab entries with _netdev after network.online which is necessary for clevis to work. (ref)

sergio-correia commented 3 months ago

I've been seeing this as well. I have found that adding the _netdev option to the relevant fstab entry allows the unlocking to proceed (tested on Rocky 8 and 9 clients, both early and late boot, and Debian 11 and 12 clients, late boot only). I have added an awk script task into my playbook after the role runs to add this option.

- name: Update fstab options
  ansible.builtin.shell: |
    name="$(awk '$2 == "{{ item.device }}" { print $1 }' /etc/crypttab | head -n 1)"
    awk -v mapper_path="/dev/mapper/$name" '{
      if ($1 == mapper_path && index($4, "_netdev") == 0) {
        $4 = $4 ",_netdev"
      }
      print
    }' /etc/fstab > /tmp/fstab
    diff -q /tmp/fstab /etc/fstab || echo changed
    mv /tmp/fstab /etc/fstab
  loop: '{{ nbde_client_bindings }}'
  register: fstab
  changed_when: '"changed" in fstab.stdout'

I believe this behavior is tied to systemd's ordering of mount units, that is, it orders fstab entries with _netdev after network.online which is necessary for clevis to work. (ref)

Yeah, this is likely in the right direction.

We may need to have _netdev in crypttab, to mark the device as requiring network, and to prevent a dependency loop, we also need to add _netdev to fstab as well, if the device is specified there for a mount point. Additionally, we may also have to enable the remote-cryptsetup.target unit.

xeluior commented 2 months ago

I have some more information that should probably be considered here from doing some testing with this role. I did have to add the _netdev option in both /etc/fstab and /etc/crypttab for automatic unlock. This works fine, however, on SystemD versions < 245, the crypttab generator creates a weird ordering issue with the dev-mapper-{name}.device unit that will hang shutdown indefinitely. This can be fixed by adding the x-systemd.requires=systemd-cryptsetup@{name}.service option to the appropriate device in /etc/fstab as well. I have an Ansible-native solution in the playbook I used to deploy this which I could turn into a PR, but it requires several new options per device in nbde_client_bindings so that it can create the appropriate crypttab and fstab entries.

EDIT: the SystemD issue mentioned https://github.com/systemd/systemd/issues/8472