nexus oid cache is broken across schema migrations

davepacheco commented 6 months ago

We updated dogfood today and ran into a new problem with instance provisions:

error_message_internal = saga ACTION error at node "sled_id": unexpected database error: type with ID 218 does not exist

After some digging, it appears that Nexus's OID caching is hanging onto a stale value that was changed by the schema migration. Namely, some sequence like this happens during the upgrade:

After mupdate, new Nexus starts up, establishes its connections to CockroachDB, and populates its OID cache (which maps enum types like sled_resource_kind to their numeric database OID).
We run the schema migration that drops and re-creates the sled_resource_kind enum. This invalidates the cache entry because now the name sled_resource_kind points to a different OID.
In whatever context Diesel uses that cache (which appears to include at least INSERT statements), it uses the old OID, which does not correspond to any existing type any more, and we get this error from the database.

The workaround is to restart Nexus instances after this happens because when they come back up they will re-populate their cache with the correct value. The real fix will be to somehow invalidate this cache after schema migrations but we're still figuring out how to do that.

davepacheco commented 6 months ago

Summary

Synopsis: Any time the database OID for a particular type name changes, as happened with the schema migration in #5287, queries that use this type that are executed by any running Nexus instances will fail with a 500 error or equivalent. This impact is pretty severe.

This problem was introduced in #4735 and recently triggered by #5287. Other schema migrations that change the OID for a type name will trigger this again.

Workaround: Restart all Nexus instances.

Plan: Fixing this properly is non-trivial. More on this below. For now, we're going to disable the proactive cache fill, which should mitigate the problem in practice at the expected cost of an extra second or so of wall clock time for Nexus integration tests that use the database. See #5565.

Props to @andrewjstone for foreseeing this class of problem.

Problem details

Background: OIDs, prepared statements, Diesel's OID caching

Database users typically refer to tables, views, and types by their name. For example, Omicron uses an enum type called sled_resource_kind. Internally, CockroachDB (like PostgreSQL) maps these names to OIDs, or object identifiers.

While people usually write SQL that refers to the names, the OIDs are used explicitly in some client-server interactions. In particular, they appear to be used in the execution of prepared statements. Concretely, when you run this Rust code:

diesel::insert_into(dsl::sled_resource)
    .values(SledResource {
        id: Uuid::new_v4(),
        sled_id: Uuid::new_v4(),
        kind: SledResourceKind::Instance,
    ...
    })

Diesel appears to execute that as a prepared statement. That means that instead of stringifying values like the uuids or SledResourceKind::Instance directly into the SQL and sending the whole SQL string to the server, it first sends what is basically a SQL template to the server that looks like this:

INSERT INTO sled_resource (id, sled_id, kind, ...) VALUES ($1, $2, $3, ...)

The server recognizes $1, $2, etc. as placeholders that will be filled in later, whenever the prepared statement gets executed. Those values are called bind parameters. (Prepared statements are an important mechanism for avoiding SQL injection because the values never get into the SQL string.)

Diesel then executes the statement that it just prepared, using the bind parameters ($1 will come from Uuid::new_v4(), $3 will come from SledResourceKind::Instance, etc.). In serializing the values, it appears to include each type's OID.

All of this means that although people don't have to deal with OIDs, clients may be aware of them. In order to execute the above INSERT, Diesel needs to know the OID for the sled_resource_kind enum. This can be queried from the database -- there's an explicit pg_types table that includes the name and OID of all known types.

Now, rather than querying the OIDs every time they're needed, Diesel caches the mapping between type name and OID. (Uh oh.)

Beyond that, Nexus proactively populates this cache with the OIDs for all type names in the omicron database -- specifically to avoid one-off queries to populate each one in the cache lazily, when it's first used. This cache is per-connection, and it's populated every time a new connection is established.

The problem

Okay, so:

database types have names and OIDs
clients sometimes need to use the OID for a given type that they know the name of
Diesel caches the mapping of type name to OID
Nexus primes that cache as soon as it connects to the database

The problem is that the mapping between a type name and its OID can change. In #5287, we wanted to remove one of the variants of a database enum. You can't do this directly, so the migration first dropped the type and then added it back with fewer variants. To do this, it had to first remove the column from the table that was using it and then re-add it later. So the whole sequence looks like this:

ALTER TABLE omicron.public.sled_resource DROP COLUMN IF EXISTS kind;
DROP TYPE IF EXISTS omicron.public.sled_resource_kind;
CREATE TYPE IF NOT EXISTS omicron.public.sled_resource_kind AS ENUM ('instance');
ALTER TABLE omicron.public.sled_resource
    ADD COLUMN IF NOT EXISTS
        kind omicron.public.sled_resource_kind
        NOT NULL
        DEFAULT 'instance';
ALTER TABLE omicron.public.sled_resource ALTER COLUMN kind DROP DEFAULT;

After dropping sled_resource_kind and adding a new type called sled_resource_kind, the new type will have a different OID than the previous one. This means that if Diesel had cached the OID for sled_resource_kind before all this, then after all this, that cache entry would refer to an OID for a type that no longer exists. That's exactly what happened here.

More precisely:

The system was running Omicron prior to #5287 and everything was fine. I'm not sure exactly what schema version it was running, but let's say it was schema version 52, which immediately precedes #5287.
We used the usual MUPdate process to update to Omicron post-#5287.
- Nexus would come up on the new software.
  - During initialization, Nexus would create a connection pool. This would establish database connections. As part of establishing these connections, Nexus would scan pg_type and populate the per-connection Diesel cache with a mapping from all known types to their OIDs. Note that at this point, Nexus is operating against a database schema version that it doesn't necessarily understand, but it doesn't know that yet.
  - Later during initialization, Nexus creates the DataStore object. During this process, it notices that the database is at version 52 (see above), but it expects it to be at 53 (because that's the current version). It enters a loop waiting for the schema to be updated.
- An operator would have run the schema-updater script, which updates the schema from whatever it was (again, presumably 52) to the current version (53). In the database, the OID corresponding to the type sled_resource_kind has changed, and previous OID for this type name now no longer corresponds to any type. But Nexus/Diesel don't know this -- their cache is wrong.
Someone goes to provision an instance. Nexus goes to make an INSERT into sled_resource. It serialized a value of type SledResourceKind with an OID that the database no longer knows about. We get the error reported initially: type with ID 218 does not exist.

How do we fix this?

Arguably, this is a Diesel bug. We believe this problem can be seen in Diesel outside of any Oxide software. We plan to test this and file an upstream bug. It's not clear how you'd fix it, though.

In practice, it should be easier for us to deal with in our case. Hitting this problem requires a schema change while Nexus is running. And we control when schema changes happen. The DataStore already tries to prevent Nexus at large from interacting with the database until the database is known to match the schema that this Nexus was built for.

Why doesn't that protect us here? Because the cache filling happens inside the connection pool, before the DataStore has even been constructed. That's well before Nexus has even checked what the current version of the schema is. That means the cache is being populated with values from some arbitrarily old schema version and then not updated once the schema is brought up to date.

Right now, there are several layers between a database connection and the DataStore:

the raw PgConnection that impls diesel's Connection trait.
a DTraceConnection that wraps the PgConnection and also impls Connection
a bb8::Pool connection pool that wraps several connections
a nexus_db_queries::db::pool::Pool that's intended to abstract the bb8 pool so that we can swap bb8 out with something else later
a nexus_db_queries::db::datastore::DataStore that uses this latter Pool type and provides application-specific database methods. DataStore::new() is what checks the schema version and waits until it matches what's expected before returning.

These were intended to be layered atop each other:

Callers that want to use a DataStore create a Pool, possibly using their own customized bb8::Builder.
Callers then pass that Pool to DataStore::new().

Critically, the cache fill implemented today happens each time bb8 (the connection pool) establishes a new connection. Right now, DataStore uses a connection from the pool in order to check the schema version -- thus violating our goal of not filling the cache until Nexus knows that it's looking at the correct schema version.

Fixing this is harder than it sounds. As explained above, DataStore is what knows about schema versions. But by the time you get to DataStore::new(), the pool has already been constructed, so connections may have already been established, and the cache may already have been filled (incorrectly).

Can Nexus somehow tell bb8 not to create any connections until it's ready? There doesn't appear to be any direct mechanism for this. It's possible that by using build_unchecked() (as opposed to build()), Nexus is already getting this behavior. At best, this is implicit -- it's not documented to not establish connections, only that it doesn't wait for connection establishment before returning. Even if this worked and we're willing to rely on implicit undocumented behavior, DataStore would need some other way to get a database connection that it could use for the schema check. But again, by the time Nexus got here, it's thrown away the configuration that it'd need to establish its own (and for good reason). We could consider using dedicated_connection() for this. Note that a dedicated connection may well get its own broken OID cache. It's not documented whether or not it does, but it appears that it would. That might actually be okay, since this code shouldn't use any of those types, but sure seems dicey.

Can Nexus tell bb8 to close all of its open connections and establish new ones and then continue working as normal? There does not appear to be an interface for doing that. We could maybe define a third level of wrapper around Connection (e.g., OurConnection<DTraceConnection<PgConnection>>) whose is_valid impl (1) initially returns true (so that we can use it for the initial check), then returns false after the schema version has been verified (causing the connection to be closed), with connections established after the schema version has been checked returning true. The details seem tricky and the result seems baroque.

Maybe the problelm is populating the cache when connections are established. Instead, what if we populate it the first time it's used by a DataStore consumer. Fortunately, almost all uses of any database connection go through one of two DataStore functions (pool_connection_authorized() or pool_connection_unauthorized()) and we could totally hook in at that point to populate the cache at that point. The challenge here is: how do we ensure that we do this exactly once for each distinct connection? We could just do it every single time, but this could in principle be expensive (it's sort of like a HashMap clone with O(number of database types) in it). We could check the cache for some known value. We'd probably want to use some made-up sentinel value. This is cheesy but could work.

Can something just clear or reset the cache once DataStore has determined that it's looking at the correct schema version? Diesel does not provide an explicit way to clear the cache or remove individual entries, but it does make a mutable reference available and we could replace it with an empty one to clear the cache. Then we could populate whatever entries we want. But we need to do this for each connection, ideally exactly once -- this is the same problem as above.

These are the options we discussed. Others are likely possible, too. None is particularly appealing. That's why for the time being we're disabling this proactive cache fill behavior to preserve correctness so that we can evaluate which if any of these options makes the most sense.

Note that by disabling our praoctive cache fill, we aren't rigorously eliminating this problem. We'd still get the wrong behavior if some code manages to use a type that changes across the schema update before the schema update happens. It's just a lot less likely because the code that runs prior to the schema update is (by design) much more limited. To really eliminate the problem, we'd need to fix all per-connection caches immediately after any schema update.

davepacheco commented 6 months ago

References: diesel-rs/diesel#2088 and diesel-rs/diesel#2071

This problem appears not entirely unknown:

On master we've implement a caching solution so there is only a single lookup per type and connection. This makes it much faster, but can lead to incorrect results if someone switches the schema at runtime.

davepacheco commented 6 months ago

Resolved by #5565. If we decide to re-add this, hopefully the context above is useful.

oxidecomputer / omicron