Open juagargi opened 1 year ago
Some notes and a proposal:
Fix the update process so that:
The current update process has the following steps:
domainEntries
table in DB with the material from (1).tree
table in DB with (3).Steps 2 to 4 happen in small batches. I.e. a batch of e.g. 1000 elements
is taken from step 1, and passed thru steps 2, 3 and 4.
This batch processing should happen in parallel, but it seldom does,
as the update of a certificate C for a domain D requires retrieval of
the certificate collection for D, insertion of C into D (following
certain rules), and back to the DB in the domainEntries
table.
We propose to change it to:
upsert
or similar) a new record per new certificate C and domain D.dirty
(formerly known as the updates
table).dirty
.tree
table in DB.dirty
table.For performance reasons, no foreign keys exist in any table.
certs
table
id
: PK, this is the SHA256 of the certificate.domain
: index, this is the SHA256 of the domain.serialized
: this is the certificate, serialized.parent
: this is the parent certificate, in the trust chain, or NULL
if root.domains
table. This table is constructed in DB from the certs
table
domain
: PK, SHA256 of the domainhash
: SHA256 of the serialized certificate collection for the domain.
This comes from all the certificates that have their certs.domain
equal
to this domains.domain
, serialized following certain rules.tree
table, remains the same as before
id
: PK, auto increment.key
: index, whatever the SMT library uses as key, 32 bytes.value
: whatever the SMT library uses as value.root
table. Should contain zero or one elements.
id
: PK, 32 bytes, SHA256 of the root of the SMT.dirty
table
key
: PK, SHA256 of each of the modified domains.The dirty
table should always be non-empty when the SMT update process starts.
Keeping in mind that this "bunch of certificates" could easily be 109 entries, spread into multiple CSV files, we cannot keep every thing in memory. We will piggyback into the DB to keep track of the updated domains, and for that we will have two main steps inside one update cycle:
certs
if the
certificate doesn't exist yet.tree
table. For that we have to load and then update the SMT structure
from the DB.We will process all the certificates in batches, whose size depends on the row count of the CSV files (if local ingest) of download batch size. Let's pick 105 as a possible batch size example.
certs
and dirty
tables, until we are done with all batches.certs
table directly.certs.domain
field into the dirty.key
table.Now we insert the certificates part of the trust chain:
certs
and dirty
tables, until we are done with all batches.certs
table.certs
table.upsert
them is more performant on average
(depends on the number of times we encounter existing identical certificates).dirty
table.We wait until all certificates are inserted into their appropriate rows
in certs
and dirty
.
This process is quite straight forward, as done previously, with an SMT updater, which is an object that maintains the mutexes, etc. required for the updating using multiple go routines. We will divide the job into batches, e.g. batches of 106 elements. These batches will be processed in parallel, using sub-batches (because probably sending one key-value to a channel to be picked up by a goroutine is going to be too much overhead, so we will group in sub-batches of e.g. 10K elements).
The steps done for each batch are:
root
table.dirty
table:
dirty.key
as key in the tree entry structure.domains.hash
and use it as value.After all batches have been processed, we can commit the SMT to the DB. We may want to disable indices before doing this (we would have to test in real life if it improves performance).
One thing I'm a bit confused about is the domain
field in the certs
table.
CommonName
field? What if the CommonName
field of a certificate is empty? Would it then be set to NULL? Or the first SAN
entry?SubjectAlternativeName
(SAN
) domains? Would you create a separate entry in the certs
table for each SAN
?
In some cases we have tens of SAN
entries for a single certificate. Which means we would store the same serialized certificate multiple times in the certs
table. On average, the number of domains is quite low (~1-2) so that might be acceptable. But it can become quite big (see #domains hist).Btw., here are hists for the #intermediate certs and validity times, which can be useful for performance estimations.
The update process disallows pushing certificate data directly to the DB, unless the previous certificates for existing domains are fetched (and transformed together with the new certificate). This makes concurrency hard, and disallows pushing data from CSV files directly to the DB.
A proposal is to have one record per certificate.