Open pgporada opened 1 year ago
After discussion during standup today, the check should instead perform an action that doesn't use HSM credentials. Too many failed logins to an HSM can lock it leading to an unplanned staff datacenter trip to re-activate said HSM partition.
The deep health check is not meant to catch every failure mode, just perform a basic liveness check with another component. If our HSMs have some type of trivial call such as outputting version info, that would probably be a sufficient check.
Discussing health checks with @beautifulentropy, I got nerd sniped and went down a rabbit hole. Here's a canned CA health check I came up with.
Set up softhsm2
This bit of code simulates a CA retrieving some data from an HSM. There's a bit of gymnastics ranging over the output from
GetSlotList
because it returns a uint and afaictrange
requires an int. Thankfully the last element of the slice is the number of slots returned from the HSM. Some of this is example code taken from miekg/pkcs11 go docs.The output of that code is
The boulder-ca's PKCS#11 config contains a credential which is essentially a cached PED key (physical key used to access the HSM during ceremonies). This is called running with an "activated partition". Calling
GetSlotList
allows us to look inside the HSM and see the slot(s)/partition(s). From there we can investigate all the returned slots withGetTokenInfo
. We could say, "hey these partitions don't contain the key objects I expect, bail out" or something. Being able to list slots is pretty cool, but checking that the intermediate key object is available seems even better.Example PKCS#11 config from integration tests.