New system not able to unlock after running role

aheath1992 commented 9 months ago

New system is unable to unlock after running the nbde_client role, after running the role get an all good from Ansible but upon reboot the system stops at the Luks encryption screen.

    - name: Import nbde_client role
      ansible.builtin.import_role:
        name: linux-system-roles.nbde_client
      vars:
        nbde_client_bindings:
          - device: "{{ root_disk | d('/dev/vda2') }}"
            encryption_password: "{{ current_password }}"
            servers: "{{ tang_servers }}"

Screenshot from 2024-02-01 16-02-08

richm commented 9 months ago

What version of the role are you using? What version of ansible are you using? What is the platform/version of your control node? What is the platform/version of your managed node? @sergio-correia what other debugging information do we need?

aheath1992 commented 9 months ago

What version of the role are you using - 1.71.1 What version of ansible are you using - 2.16 What is the platform/version of your control node - fedora 39 What is the platform/version of your managed node - RHEL 8.8

sergio-correia commented 9 months ago

Hello. Here are some more info that may be helpful to debug this:

what is the version of clevis in the managed node?
what are all the encrypted devices in the managed node? is it that /dev/vda2 or do we have others?
please check whether the initrd from the managed node includes the clevis machinery to perform the unlocking in early boot (if that is the case); something like lsinitrd | grep clevis can help here
also, check whether network is enabled for early boot, which it will need in order to access the tang servers; the role includes "rd.neednet=1" to indicate this. It will likely be included in the initrd in a file named etc/cmdline.d/01-default.conf. Perhaps something like this could help to verify this: lsinitrd /boot/initramfs-$(uname -r).img etc/cmdline.d/01-default.conf
if you have access to the tang server, also please check whether there are any requests coming from the client (clevis)
check also whether clevis-luks-askpass.path unit is enabled: systemctl status clevis-luks-askpass.path; it will be used if we are going to decrypt a disk in late boot phase
check also the clevis configuration for the specified device; e.g.: clevis luks list -d /dev/vda2

@richm: I wonder if it makes sense to have some "action"/"state" to collect some of these information from the managed hosts, to help troubleshooting such issues?

aheath1992 commented 9 months ago

what is the version of clevis in the managed node? - clevis-15-15.el8.x86_64
what are all the encrypted devices in the managed node? is it that /dev/vda2 or do we have others? - just the root device in this case /dev/vda2

please check whether the initrd from the managed node includes the clevis machinery to perform the unlocking in early boot (if that is the case); something like lsinitrd | grep clevis can help here

lsinitrd | grep clevis
clevis
clevis-pin-null
clevis-pin-sss
clevis-pin-tang
clevis-pin-tpm2
lrwxrwxrwx   1 root     root           48 Jan 20  2023 etc/systemd/system/cryptsetup.target.wants/clevis-luks-askpass.path -> /usr/lib/systemd/system/clevis-luks-askpass.path
-rwxr-xr-x   1 root     root         1679 Jan 20  2023 usr/bin/clevis
-rwxr-xr-x   1 root     root         1654 Oct 28  2020 usr/bin/clevis-decrypt
-rwxr-xr-x   1 root     root         1148 Jan 20  2023 usr/bin/clevis-decrypt-null
-rwxr-xr-x   1 root     root        25296 Jan 20  2023 usr/bin/clevis-decrypt-sss
-rwxr-xr-x   1 root     root         3560 Jan 20  2023 usr/bin/clevis-decrypt-tang
-rwxr-xr-x   1 root     root         5121 Oct 28  2020 usr/bin/clevis-decrypt-tpm2
-rw-r--r--   1 root     root        32885 Jan 20  2023 usr/bin/clevis-luks-common-functions
-rwxr-xr-x   1 root     root         2115 Oct 28  2020 usr/bin/clevis-luks-list
-rwxr-xr-x   1 root     root         2466 Jan 20  2023 usr/libexec/clevis-luks-askpass
-rw-r--r--   1 root     root          302 Oct 28  2020 usr/lib/systemd/system/clevis-luks-askpass.path
-rw-r--r--   1 root     root          190 Jan 20  2023 usr/lib/systemd/system/clevis-luks-askpass.service

lsinitrd /boot/initramfs-$(uname -r).img etc/cmdline.d/01-default.conf
rd.neednet=1

systemctl status clevis-luks-askpass.path
● clevis-luks-askpass.path - Forward Password Requests to Clevis Directory Watch
Loaded: loaded (/usr/lib/systemd/system/clevis-luks-askpass.path; enabled; vendor preset: enabled)
Active: active (waiting) since Fri 2024-02-02 13:54:24 UTC; 24s ago
 Docs: man:clevis-luks-unlockers(7)

clevis luks list -d /dev/vda2
1: sss '{"t":1,"pins":{"tang":[{"url":"http://tang1"},{"url":"http://tang2"}]}}'

sergio-correia commented 9 months ago

At a first glance, it looks OK -- could you also check `journalctl , to see if any useful information shows up, please? (I forgot to mention beforehand, but feel free to redact any IP addresses, if required)

journalctl -xf -u clevis-luks-askpass.service

aheath1992 commented 9 months ago

Feb 02 15:21:20 clevis-test.ansi-001.prod.iad2.dc.redhat.com clevis-luks-askpass[11941]: Error communicating with the server http://tang1 Feb 02 15:21:20 clevis-test.ansi-001.prod.iad2.dc.redhat.com clevis-luks-askpass[11942]: Error communicating with the server http://tang2

telnet tang1 80 Trying tang1... Connected to tang1. Escape character is '^]'.

telnet tang2 80 Trying tang2... Connected to tang2. Escape character is '^]'.

richm commented 3 months ago

@sergio-correia any idea?

xeluior commented 3 months ago

I've been seeing this as well. I have found that adding the _netdev option to the relevant fstab entry allows the unlocking to proceed (tested on Rocky 8 and 9 clients, both early and late boot, and Debian 11 and 12 clients, late boot only). I have added an awk script task into my playbook after the role runs to add this option.

- name: Update fstab options
  ansible.builtin.shell: |
    name="$(awk '$2 == "{{ item.device }}" { print $1 }' /etc/crypttab | head -n 1)"
    awk -v mapper_path="/dev/mapper/$name" '{
      if ($1 == mapper_path && index($4, "_netdev") == 0) {
        $4 = $4 ",_netdev"
      }
      print
    }' /etc/fstab > /tmp/fstab
    diff -q /tmp/fstab /etc/fstab || echo changed
    mv /tmp/fstab /etc/fstab
  loop: '{{ nbde_client_bindings }}'
  register: fstab
  changed_when: '"changed" in fstab.stdout'

I believe this behavior is tied to systemd's ordering of mount units, that is, it orders fstab entries with _netdev after network.online which is necessary for clevis to work. (ref)

sergio-correia commented 3 months ago

I've been seeing this as well. I have found that adding the _netdev option to the relevant fstab entry allows the unlocking to proceed (tested on Rocky 8 and 9 clients, both early and late boot, and Debian 11 and 12 clients, late boot only). I have added an awk script task into my playbook after the role runs to add this option.
- name: Update fstab options
  ansible.builtin.shell: |
    name="$(awk '$2 == "{{ item.device }}" { print $1 }' /etc/crypttab | head -n 1)"
    awk -v mapper_path="/dev/mapper/$name" '{
      if ($1 == mapper_path && index($4, "_netdev") == 0) {
        $4 = $4 ",_netdev"
      }
      print
    }' /etc/fstab > /tmp/fstab
    diff -q /tmp/fstab /etc/fstab || echo changed
    mv /tmp/fstab /etc/fstab
  loop: '{{ nbde_client_bindings }}'
  register: fstab
  changed_when: '"changed" in fstab.stdout'
I believe this behavior is tied to systemd's ordering of mount units, that is, it orders fstab entries with _netdev after network.online which is necessary for clevis to work. (ref)

Yeah, this is likely in the right direction.

We may need to have _netdev in crypttab, to mark the device as requiring network, and to prevent a dependency loop, we also need to add _netdev to fstab as well, if the device is specified there for a mount point. Additionally, we may also have to enable the remote-cryptsetup.target unit.

xeluior commented 2 months ago

I have some more information that should probably be considered here from doing some testing with this role. I did have to add the _netdev option in both /etc/fstab and /etc/crypttab for automatic unlock. This works fine, however, on SystemD versions < 245, the crypttab generator creates a weird ordering issue with the dev-mapper-{name}.device unit that will hang shutdown indefinitely. This can be fixed by adding the x-systemd.requires=systemd-cryptsetup@{name}.service option to the appropriate device in /etc/fstab as well. I have an Ansible-native solution in the playbook I used to deploy this which I could turn into a PR, but it requires several new options per device in nbde_client_bindings so that it can create the appropriate crypttab and fstab entries.

EDIT: the SystemD issue mentioned https://github.com/systemd/systemd/issues/8472

linux-system-roles / nbde_client

New system not able to unlock after running role #150