letsencrypt / boulder

An ACME-based certificate authority, written in Go.
Mozilla Public License 2.0
5.22k stars 607 forks source link

ca: implement deep health check #6929

Open pgporada opened 1 year ago

pgporada commented 1 year ago

Discussing health checks with @beautifulentropy, I got nerd sniped and went down a rabbit hole. Here's a canned CA health check I came up with.


Set up softhsm2

export SOFTHSM2_CONF=$PWD/softhsm.conf
mkdir softhsm
echo "directories.tokendir = ./softhsm/" > SOFTHSM2_CONF
softhsm2-util --init-token --slot 0 --label "intermediate signing key (ecdsa)" --so-pin 1234 --pin 1234
softhsm2-util --init-token --slot 1 --label "intermediate signing key (rsa)" --so-pin 1234 --pin 1234
softhsm2-util --init-token --slot 2 --label "my hopes and dreams" --so-pin 1234 --pin 1234

This bit of code simulates a CA retrieving some data from an HSM. There's a bit of gymnastics ranging over the output from GetSlotList because it returns a uint and afaict range requires an int. Thankfully the last element of the slice is the number of slots returned from the HSM. Some of this is example code taken from miekg/pkcs11 go docs.

package main

import (
    "fmt"

    "github.com/miekg/pkcs11"
)

func main() {
    p := pkcs11.New("/usr/lib/softhsm/libsofthsm2.so")
    err := p.Initialize()
    if err != nil {
        panic(err)
    }

    defer p.Destroy()
    defer p.Finalize()

    slots, err := p.GetSlotList(true)
    if err != nil {
        panic(err)
    }
    fmt.Println(slots)

    numSlots := len(slots) - 1
    for i := 0; i < numSlots; i++ {
        info, err := p.GetTokenInfo(slots[i])
        if err != nil {
            panic(err)
        }
        fmt.Println(info)
    }
}

The output of that code is

$ go run main.go 
[688307114 1096782350 1107801826 3]
{intermediate signing key (rsa) SoftHSM project SoftHSM v2 c7dd7943a906bbaa 1069 0 18446744073709551615 0 18446744073709551615 255 4 18446744073709551615 18446744073709551615 18446744073709551615 18446744073709551615 {2 6} {2 6} 2023060920510900}
{my hopes and dreams SoftHSM project SoftHSM v2 01fe4de9c15f920e 1069 0 18446744073709551615 0 18446744073709551615 255 4 18446744073709551615 18446744073709551615 18446744073709551615 18446744073709551615 {2 6} {2 6} 2023060920510900}
{intermediate signing key (ecdsa) SoftHSM project SoftHSM v2 bb8648814207b6e2 1069 0 18446744073709551615 0 18446744073709551615 255 4 18446744073709551615 18446744073709551615 18446744073709551615 18446744073709551615 {2 6} {2 6} 2023060920510900}

The boulder-ca's PKCS#11 config contains a credential which is essentially a cached PED key (physical key used to access the HSM during ceremonies). This is called running with an "activated partition". Calling GetSlotList allows us to look inside the HSM and see the slot(s)/partition(s). From there we can investigate all the returned slots with GetTokenInfo. We could say, "hey these partitions don't contain the key objects I expect, bail out" or something. Being able to list slots is pretty cool, but checking that the intermediate key object is available seems even better.

Example PKCS#11 config from integration tests.

$ cat ./.hierarchy/intermediate-signing-key-ecdsa.pkcs11.json
{"module": "/usr/lib/softhsm/libsofthsm2.so", "tokenLabel": "intermediate signing key (ecdsa)", "pin": "1234"}
pgporada commented 1 year ago

After discussion during standup today, the check should instead perform an action that doesn't use HSM credentials. Too many failed logins to an HSM can lock it leading to an unplanned staff datacenter trip to re-activate said HSM partition.

The deep health check is not meant to catch every failure mode, just perform a basic liveness check with another component. If our HSMs have some type of trivial call such as outputting version info, that would probably be a sufficient check.