Closed davepacheco closed 6 months ago
Synopsis: Any time the database OID for a particular type name changes, as happened with the schema migration in #5287, queries that use this type that are executed by any running Nexus instances will fail with a 500 error or equivalent. This impact is pretty severe.
This problem was introduced in #4735 and recently triggered by #5287. Other schema migrations that change the OID for a type name will trigger this again.
Workaround: Restart all Nexus instances.
Plan: Fixing this properly is non-trivial. More on this below. For now, we're going to disable the proactive cache fill, which should mitigate the problem in practice at the expected cost of an extra second or so of wall clock time for Nexus integration tests that use the database. See #5565.
Props to @andrewjstone for foreseeing this class of problem.
Database users typically refer to tables, views, and types by their name. For example, Omicron uses an enum type called sled_resource_kind
. Internally, CockroachDB (like PostgreSQL) maps these names to OIDs, or object identifiers.
While people usually write SQL that refers to the names, the OIDs are used explicitly in some client-server interactions. In particular, they appear to be used in the execution of prepared statements. Concretely, when you run this Rust code:
diesel::insert_into(dsl::sled_resource)
.values(SledResource {
id: Uuid::new_v4(),
sled_id: Uuid::new_v4(),
kind: SledResourceKind::Instance,
...
})
Diesel appears to execute that as a prepared statement. That means that instead of stringifying values like the uuids or SledResourceKind::Instance
directly into the SQL and sending the whole SQL string to the server, it first sends what is basically a SQL template to the server that looks like this:
INSERT INTO sled_resource (id, sled_id, kind, ...) VALUES ($1, $2, $3, ...)
The server recognizes $1
, $2
, etc. as placeholders that will be filled in later, whenever the prepared statement gets executed. Those values are called bind parameters. (Prepared statements are an important mechanism for avoiding SQL injection because the values never get into the SQL string.)
Diesel then executes the statement that it just prepared, using the bind parameters ($1
will come from Uuid::new_v4()
, $3
will come from SledResourceKind::Instance
, etc.). In serializing the values, it appears to include each type's OID.
All of this means that although people don't have to deal with OIDs, clients may be aware of them. In order to execute the above INSERT
, Diesel needs to know the OID for the sled_resource_kind
enum. This can be queried from the database -- there's an explicit pg_types
table that includes the name and OID of all known types.
Now, rather than querying the OIDs every time they're needed, Diesel caches the mapping between type name and OID. (Uh oh.)
Beyond that, Nexus proactively populates this cache with the OIDs for all type names in the omicron
database -- specifically to avoid one-off queries to populate each one in the cache lazily, when it's first used. This cache is per-connection, and it's populated every time a new connection is established.
Okay, so:
The problem is that the mapping between a type name and its OID can change. In #5287, we wanted to remove one of the variants of a database enum. You can't do this directly, so the migration first dropped the type and then added it back with fewer variants. To do this, it had to first remove the column from the table that was using it and then re-add it later. So the whole sequence looks like this:
ALTER TABLE omicron.public.sled_resource DROP COLUMN IF EXISTS kind;
DROP TYPE IF EXISTS omicron.public.sled_resource_kind;
CREATE TYPE IF NOT EXISTS omicron.public.sled_resource_kind AS ENUM ('instance');
ALTER TABLE omicron.public.sled_resource
ADD COLUMN IF NOT EXISTS
kind omicron.public.sled_resource_kind
NOT NULL
DEFAULT 'instance';
ALTER TABLE omicron.public.sled_resource ALTER COLUMN kind DROP DEFAULT;
After dropping sled_resource_kind
and adding a new type called sled_resource_kind
, the new type will have a different OID than the previous one. This means that if Diesel had cached the OID for sled_resource_kind
before all this, then after all this, that cache entry would refer to an OID for a type that no longer exists. That's exactly what happened here.
More precisely:
pg_type
and populate the per-connection Diesel cache with a mapping from all known types to their OIDs. Note that at this point, Nexus is operating against a database schema version that it doesn't necessarily understand, but it doesn't know that yet.sled_resource_kind
has changed, and previous OID for this type name now no longer corresponds to any type. But Nexus/Diesel don't know this -- their cache is wrong.sled_resource
. It serialized a value of type SledResourceKind
with an OID that the database no longer knows about. We get the error reported initially: type with ID 218 does not exist
.Arguably, this is a Diesel bug. We believe this problem can be seen in Diesel outside of any Oxide software. We plan to test this and file an upstream bug. It's not clear how you'd fix it, though.
In practice, it should be easier for us to deal with in our case. Hitting this problem requires a schema change while Nexus is running. And we control when schema changes happen. The DataStore
already tries to prevent Nexus at large from interacting with the database until the database is known to match the schema that this Nexus was built for.
Why doesn't that protect us here? Because the cache filling happens inside the connection pool, before the DataStore
has even been constructed. That's well before Nexus has even checked what the current version of the schema is. That means the cache is being populated with values from some arbitrarily old schema version and then not updated once the schema is brought up to date.
Right now, there are several layers between a database connection and the DataStore:
PgConnection
that impls diesel's Connection
trait.DTraceConnection
that wraps the PgConnection
and also impls Connection
bb8::Pool
connection pool that wraps several connectionsnexus_db_queries::db::pool::Pool
that's intended to abstract the bb8 pool so that we can swap bb8 out with something else laternexus_db_queries::db::datastore::DataStore
that uses this latter Pool
type and provides application-specific database methods. DataStore::new()
is what checks the schema version and waits until it matches what's expected before returning.These were intended to be layered atop each other:
Pool
, possibly using their own customized bb8::Builder
.Pool
to DataStore::new()
.Critically, the cache fill implemented today happens each time bb8
(the connection pool) establishes a new connection. Right now, DataStore
uses a connection from the pool in order to check the schema version -- thus violating our goal of not filling the cache until Nexus knows that it's looking at the correct schema version.
Fixing this is harder than it sounds. As explained above, DataStore
is what knows about schema versions. But by the time you get to DataStore::new()
, the pool has already been constructed, so connections may have already been established, and the cache may already have been filled (incorrectly).
Can Nexus somehow tell bb8
not to create any connections until it's ready? There doesn't appear to be any direct mechanism for this. It's possible that by using build_unchecked()
(as opposed to build()
), Nexus is already getting this behavior. At best, this is implicit -- it's not documented to not establish connections, only that it doesn't wait for connection establishment before returning. Even if this worked and we're willing to rely on implicit undocumented behavior, DataStore
would need some other way to get a database connection that it could use for the schema check. But again, by the time Nexus got here, it's thrown away the configuration that it'd need to establish its own (and for good reason). We could consider using dedicated_connection()
for this. Note that a dedicated connection may well get its own broken OID cache. It's not documented whether or not it does, but it appears that it would. That might actually be okay, since this code shouldn't use any of those types, but sure seems dicey.
Can Nexus tell bb8
to close all of its open connections and establish new ones and then continue working as normal? There does not appear to be an interface for doing that. We could maybe define a third level of wrapper around Connection
(e.g., OurConnection<DTraceConnection<PgConnection>>
) whose is_valid
impl (1) initially returns true (so that we can use it for the initial check), then returns false
after the schema version has been verified (causing the connection to be closed), with connections established after the schema version has been checked returning true
. The details seem tricky and the result seems baroque.
Maybe the problelm is populating the cache when connections are established. Instead, what if we populate it the first time it's used by a DataStore consumer. Fortunately, almost all uses of any database connection go through one of two DataStore
functions (pool_connection_authorized()
or pool_connection_unauthorized()
) and we could totally hook in at that point to populate the cache at that point. The challenge here is: how do we ensure that we do this exactly once for each distinct connection? We could just do it every single time, but this could in principle be expensive (it's sort of like a HashMap clone with O(number of database types)
in it). We could check the cache for some known value. We'd probably want to use some made-up sentinel value. This is cheesy but could work.
Can something just clear or reset the cache once DataStore
has determined that it's looking at the correct schema version? Diesel does not provide an explicit way to clear the cache or remove individual entries, but it does make a mutable reference available and we could replace it with an empty one to clear the cache. Then we could populate whatever entries we want. But we need to do this for each connection, ideally exactly once -- this is the same problem as above.
These are the options we discussed. Others are likely possible, too. None is particularly appealing. That's why for the time being we're disabling this proactive cache fill behavior to preserve correctness so that we can evaluate which if any of these options makes the most sense.
Note that by disabling our praoctive cache fill, we aren't rigorously eliminating this problem. We'd still get the wrong behavior if some code manages to use a type that changes across the schema update before the schema update happens. It's just a lot less likely because the code that runs prior to the schema update is (by design) much more limited. To really eliminate the problem, we'd need to fix all per-connection caches immediately after any schema update.
References: diesel-rs/diesel#2088 and diesel-rs/diesel#2071
This problem appears not entirely unknown:
On master we've implement a caching solution so there is only a single lookup per type and connection. This makes it much faster, but can lead to incorrect results if someone switches the schema at runtime.
Resolved by #5565. If we decide to re-add this, hopefully the context above is useful.
We updated dogfood today and ran into a new problem with instance provisions:
After some digging, it appears that Nexus's OID caching is hanging onto a stale value that was changed by the schema migration. Namely, some sequence like this happens during the upgrade:
sled_resource_kind
to their numeric database OID).sled_resource_kind
enum. This invalidates the cache entry because now the namesled_resource_kind
points to a different OID.INSERT
statements), it uses the old OID, which does not correspond to any existing type any more, and we get this error from the database.The workaround is to restart Nexus instances after this happens because when they come back up they will re-populate their cache with the correct value. The real fix will be to somehow invalidate this cache after schema migrations but we're still figuring out how to do that.