oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

how does diesel report network errors during a query? #263

Open davepacheco opened 3 years ago

davepacheco commented 3 years ago

While working on #262, I was looking at the Diesel error types, and I'm a little worried about what happens if there's a network error while a query is outstanding. That would include a fatal error on the socket (e.g., ETIMEDOUT or ECONNRESET) encountered either while attempting to write the request (this is one of the harder ones to test) or while attempting to read response data back. It could also include protocol violations resulting from sockets gracefully closed (e.g., because CockroachDB crashed) at a point that's not legal for the protocol.

I'm worried because diesel::Result::Error claims to cover "all the ways that a query can fail". None of the variants looks suitable to these network issues. The closest one is "DatabaseError", but that's documented as "the database returned an error" -- that's not what will happen in this case. And if we look at the variants of that (well, really DatabaseErrorKind), again, it's not clear which one would fit. In the main branch there are a few other variants, including ClosedConnection, but the docs say "This error is only detected for PostgreSQL and is emitted on a best-effort basis and may be missed." I'm not sure what that means.

smklein commented 3 years ago

Thanks for filing, knowing what to expect will be useful for these situations.

Within Diesel, the Postgres Connection subdirectory seems most relevant for this class of errors. In particular, while executing batch_execute, the error translation from "raw Postgres error" to diesel::result::Error occurs - this seems to be the spot where the "output of libpq is parsed into a structured rust type".

In particular, the DatabaseErrorKind::ClosedConnection seems like it gets created here, if a particular connection closed message is found:

https://github.com/diesel-rs/diesel/blob/a51ebbd712afea1218b3ff417a24509ebab3674a/diesel/src/pg/connection/result.rs#L63-L67

IMO, this is the "right spot, and right structure" to capture this sort of error (immediately after accessing the underlying postgres layer), but the mechanism seems a little... hacky. Checking equality against a very specific string to distinguish between DatabaseErrorKind::Unknown and DatabaseErrorKind::ClosedConnection seems brittle.

Ultimately, it looks like Diesel is able to do what it can with the result of PQresultStatus from libpq plus PQresultErrorField, also from libpq - in particular when probing the PG_DIAG_SQLSTATE field.

The list of constants to check against exists here: https://docs.rs/libpq/1.3.3/libpq/state/index.html - I could patch diesel and make more of these translate into a ClosedConnection error as we see fit?

Regardless - this seems like the "raw interface to connection errors, and where Diesel translates them for Postgres".

smklein commented 3 years ago

Well, postgres makes me sad.

I tried modifying Diesel to match on these codes explicitly, instead of the string value.

I setup a postgres server to run SELECT pg_sleep(100);, and tried killing it (first with Control-C, but that propagates an admin_shutdown error, so I tried kill -9 instead).

Turns out, when a server crashes, no error code is returned, and the string is the only value Diesel has available to match on... which probably explains why they're doing this string matching in the first place.

davepacheco commented 3 years ago

Ultimately, it looks like Diesel is able to do what it can with the result of PQresultStatus from libpq plus PQresultErrorField, also from libpq - in particular when probing the PG_DIAG_SQLSTATE field.

I came to a similar conclusion -- we used to look at the SQLSTATE: https://github.com/oxidecomputer/omicron/blob/2f841804673e3e7254037f3da086279df0943c47/omicron-common/src/db.rs#L73-L97

I see that that's actually still in "main", but I don't think we're using it anywhere. Is omicron-common/src/db.rs just dead code? It looks like it.

smklein commented 3 years ago

https://github.com/oxidecomputer/omicron/pull/266 to remove dead code