netsec-ethz / fpki

4 stars 1 forks source link

DB Design prevents Concurrency #29

Open juagargi opened 1 year ago

juagargi commented 1 year ago

The update process disallows pushing certificate data directly to the DB, unless the previous certificates for existing domains are fetched (and transformed together with the new certificate). This makes concurrency hard, and disallows pushing data from CSV files directly to the DB.

A proposal is to have one record per certificate.

juagargi commented 1 year ago

Some notes and a proposal:

Fix the update process so that:

  1. It can easily be processed by several routines.
  2. Data can be pushed directly from CSV files into the DB.

The current update process has the following steps:

  1. Obtain the data (download, from file, etc).
  2. Update the domainEntries table in DB with the material from (1).
  3. Update the Sparse Merkle Tree with the material from (2).
  4. Update the tree table in DB with (3).

Steps 2 to 4 happen in small batches. I.e. a batch of e.g. 1000 elements is taken from step 1, and passed thru steps 2, 3 and 4. This batch processing should happen in parallel, but it seldom does, as the update of a certificate C for a domain D requires retrieval of the certificate collection for D, insertion of C into D (following certain rules), and back to the DB in the domainEntries table.

We propose to change it to:

  1. Obtain the data.
  2. Create (upsert or similar) a new record per new certificate C and domain D.
  3. Write the modified domains into a table dirty (formerly known as the updates table).
  4. In DB and via a stored procedure, serialize the certificate collection (following certain rules) and write it, plus its SHA256, to the table.
  5. Wait until all batches have finished.
  6. Update the SMT with the material from (4), and using the domains in dirty.
  7. Store the tree table in DB.
  8. Truncate the dirty table.

Tables

For performance reasons, no foreign keys exist in any table.

The dirty table should always be non-empty when the SMT update process starts.

Detailed Steps to Update a Bunch of Certificates

Keeping in mind that this "bunch of certificates" could easily be 109 entries, spread into multiple CSV files, we cannot keep every thing in memory. We will piggyback into the DB to keep track of the updated domains, and for that we will have two main steps inside one update cycle:

  1. Update the certificates themselves. This means adding a new entry in certs if the certificate doesn't exist yet.
  2. Update the tree table. For that we have to load and then update the SMT structure from the DB.

1. Updating the Certificates

We will process all the certificates in batches, whose size depends on the row count of the CSV files (if local ingest) of download batch size. Let's pick 105 as a possible batch size example.

If we have CSV files, for each one of the batches:

  1. Disable indices on the certs and dirty tables, until we are done with all batches.
  2. The trust chain will be only referenced by the entries. The content of the certificates that are part of the trust chain is inserted later.
  3. Insert the CSV into the certs table directly.
  4. If step 1 is not possible, create a temporary CSV file holding the appropriate values, and then do step 1.
  5. Insert only the certs.domain field into the dirty.key table.

Now we insert the certificates part of the trust chain:

  1. TODO(juagargi) This would be much better done if the local files are already in the format we expect, uncompressed.

If we instead have in-memory downloads, we do:

  1. Disable indices on the certs and dirty tables, until we are done with all batches.
  2. We will treat the data to split it into two parts: domains pertaining to the certificates, and certificates themselves. The certificates can come from the domains or from the trust chain.
  3. We divide the data into batches of e.g. 1000 certificate entries.
  4. For each batch:
    1. Get a bit mask of the already present certificates in the certs table.
    2. For those certificates not present, insert them into the certs table.
    3. TODO(juagargi) check if upsert them is more performant on average (depends on the number of times we encounter existing identical certificates).
    4. We update the dirty table.
    5. We group all the trust chain certificates into a map.
    6. We get a bit mask of those certificates not present in the DB.
    7. We insert those certificates not present in the bit mask.

We wait until all certificates are inserted into their appropriate rows in certs and dirty.

2. Updating the SMT

This process is quite straight forward, as done previously, with an SMT updater, which is an object that maintains the mutexes, etc. required for the updating using multiple go routines. We will divide the job into batches, e.g. batches of 106 elements. These batches will be processed in parallel, using sub-batches (because probably sending one key-value to a channel to be picked up by a goroutine is going to be too much overhead, so we will group in sub-batches of e.g. 10K elements).

The steps done for each batch are:

  1. Load the root from root table.
  2. For each entry in the dirty table:
    1. Use the dirty.key as key in the tree entry structure.
    2. Retrieve its domains.hash and use it as value.
  3. Wait for all the routines to finish their sub-batch.

After all batches have been processed, we can commit the SMT to the DB. We may want to disable indices before doing this (we would have to test in real life if it improves performance).

cyrill-k commented 1 year ago

One thing I'm a bit confused about is the domain field in the certs table.

Btw., here are hists for the #intermediate certs and validity times, which can be useful for performance estimations.