letsencrypt / boulder

An ACME-based certificate authority, written in Go.
Mozilla Public License 2.0
5.22k stars 607 forks source link

sa: add authz reuse table #7715

Open jsha opened 2 months ago

jsha commented 2 months ago

Right now our authz2 table looks like this:

CREATE TABLE `authz2` (
  `id` bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  `identifierType` tinyint(4) NOT NULL,
  `identifierValue` varchar(255) NOT NULL,
  `registrationID` bigint(20) NOT NULL,
  `status` tinyint(4) NOT NULL,
  `expires` datetime NOT NULL,
  `challenges` tinyint(4) NOT NULL,
  `attempted` tinyint(4) DEFAULT NULL,
  `attemptedAt` datetime DEFAULT NULL,
  `token` binary(32) NOT NULL,
  `validationError` mediumblob DEFAULT NULL,
  `validationRecord` mediumblob DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `regID_expires_idx` (`registrationID`,`status`,`expires`),
  KEY `regID_identifier_status_expires_idx` (`registrationID`,`identifierType`,`identifierValue`,`status`,`expires`),
  KEY `expires_idx` (`expires`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
 PARTITION BY RANGE(id)
(PARTITION p_start VALUES LESS THAN (MAXVALUE));

Those indexes are pretty big, particularly the regID_identifier_status_expires_idx one. Among other problems, it has a datetime field, which is high cardinality.

One of the main things (only thing?) we use that index for is authz reuse. We can do that more efficiently with a separate table, and we can also design that table to better fit the assumptions of key-value storage: only one key, and it's the primary key.

Roughly speaking, this table would map:

(account, identifier) -> (authz ID, expiration)

Each time an authz is successfully validated, we would append or update a row in this table. Compared to the current system, this defers some amount of work until successful validation, which is nice because so many validations fail.

We can skip encoding the identifier type, because the only two identifier types we ever plan to support have completely non-overlapping syntax (hostnames and IP addresses).

If we deploy this table in a database that supports TTLs, we would set the TTL of this row to the expiration of the authz, and update it as new authzs are written to it. Old rows would automatically be removed by the database system. If we choose to deploy it in a database that does not support TTLs, we could prepend a rough granularity epoch (e.g. number of 90-day periods since Jan 2024), making the key (epoch, account, identifier). That would allow partitioning and dropping of old partitions.

If the key is (epoch, account, identifier), that means querying for authzs to reuse would have to query for multiple keys: one in the current epoch, one in the previous epoch, and potential longer-ago epochs. If we assume the authz lifetime is always less than the epoch (which would be true with our current 30-day authzs and a hypothetical epoch of 90 days), then we would only ever have to query for two epochs, current and previous.

To find authzs for reuse for a new order, we would query for the appropriate account and identifier, check the result's expiration, then fetch the corresponding authz (to check whether it has been deactivated). This will require one additional round trip compared to our current system, which queries the authz2 table directly and so gets status right away. This can be a batch query for several identifiers (using IN syntax) or it could be several parallel queries.

One refinement could be: when an authz2 is deactivated, we delete its row in the reuse table. That would allow us to directly incorporate returned authzs in a new-order without a second query to check their status.

jsha commented 2 months ago

Schema could look like:

CREATE TABLE `authzReuse` (
  `accountID_identifier` VARCHAR(300) NOT NULL,
  `authzID` VARCHAR(255) NOT NULL,
  `expires` DATETIME NOT NULL
)

Note that authzID is string-valued to accommodate possible future random generation of authz IDs (though we could also use random BIGINT generation, with low probability of collision due to a 64-bit space).

As per usual, we should create the new schema in the sa/db-next/boulder_sa/ directory, and gate the SA code that references it behind a new feature flag (features/features.go) that is set in test/config-next/sa.json.

This will require updating:

I said in the original comment:

Each time an authz is successfully validated, we would append or update a row in this table. Compared to the current system, this defers some amount of work until successful validation, which is nice because so many validations fail.

This was incorrect, because we currently allow pending authz reuse as well. So to faithfully implement our current behavior we would have to write to the authzReuse table on every authz creation (i.e. also update SQLStorageAuthority.NewOrderAndAuthzs. When creating a pending authz we might need to do a check-and-set of the authzReuse table to make sure we don't overwrite the record of a valid authz with the record of a pending authz. But probably we could rely on this assumption: so long as authz reuse is working properly, we would never be creating a pending authz if there was a a valid authz available.

The other possibility regarding pending authz reuse would be to remove it. It exists solely for our benefit[^1]: in the case of clients that are repeatedly requesting the same names but not fulfilling them, it allows us to avoid creating database objects. However, these days order reuse probably fulfills the same purpose and we could turn off pending authz reuse without much impact. We should measure this, and if it checks out we should do it as a precursor to this issue.

Also in researching this issue I was reminded that our order-creation process is weird: The RA is responsible for checking if there are any reusable authzs (by calling the SA.GetAuthorizations2), and then calls SA.NewOrderAndAuthzs with a NewOrderRequest that includes a list of v2Authorizations. In other words, the RA tells the SA which authorizations should get reused for a given order. This could probably be simplified by making the SA solely responsible for authz reuse (when applicable).

[^1]: Of course, valid authz reuse is different and some subscribers do rely on it.

aarongable commented 1 month ago

But probably we could rely on this assumption: so long as authz reuse is working properly, we would never be creating a pending authz if there was a a valid authz available.

This isn't quite true today -- we create a new (pending) authz even when there is an existing (valid) authz if that valid authz is going to expire very soon, to ensure that the order (whose lifetime is limited by the shortest authz attached to it) has a reasonably-far-out expiration time.

jsha commented 1 month ago

This isn't quite true today -- we create a new (pending) authz even when there is an existing (valid) authz if that valid authz is going to expire very soon, to ensure that the order (whose lifetime is limited by the shortest authz attached to it) has a reasonably-far-out expiration time.

You're right. I should have said "so long as authz reuse is working properly, we would never be creating a pending authz if there was a valid authz reusable."

In other words, when we create that new pending authz, it's okay to overwrite the authzReuse entry pointing at an almost-expired valid authz, because no future request is going to reuse that authz.

jsha commented 4 weeks ago

Edited the initial post to add:

If the key is (epoch, account, identifier), that means querying for authzs to reuse would have to query for multiple keys: one in the current epoch, one in the previous epoch, and potential longer-ago epochs. If we assume the authz lifetime is always less than the epoch (which would be true with our current 30-day authzs and a hypothetical epoch of 90 days), then we would only ever have to query for two epochs, current and previous.