oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

Nexus holds on to specific service backends of crdb and crucible-pantry for too long #3763

Open askfongjojo opened 1 year ago

askfongjojo commented 1 year ago

While testing service failover on rack2, I noticed that nexus held on to the same cockroachdb backend without attempting to use any of the other nodes in the 5-node database cluster, causing requests to fail until the one it favored came back.

The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.

smklein commented 1 year ago

I think ignoring the TTL makes sense (not that it's right, just that it makes sense) in the context of this code:

https://github.com/oxidecomputer/omicron/blob/a29f08ba1c0d943101f6811436c6e6715b5ce252/nexus/src/context.rs#L167-L195

Our API to interface with CockroachDB from Nexus involves "constructing a URL", and right now, we embed a single point-in-time address of CockroachDB into this URL.

We should probably avoid doing this, and embed the service name into the URL, so it actually use DNS during later lookups.

askfongjojo commented 1 year ago

Understood. The current behavior, in conjunction with #3613, results in an 1 in 15 probability that a user will not be able to use the system at all. I happened to be in this situation when my workstation always used the one nexus that had the first CRDB backend I brought down to test failover. :(

jmpesp commented 1 year ago

The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.

In this case, there's no TTL: a specific pantry from the list of available ones is selected once and used for a disk until that disk is finalized.

One thing that could be done is to check of the selected pantry is still responsive and choose another one if not, but it was important not keep as much code out of the import chunk hot-path as possible, as any checks there will be multiplied by the number of chunks to import and slow down imports. That being said, slow imports are better than non-working imports :) I'll give this some thought.

askfongjojo commented 1 year ago

Sorry for not being clear in pantry's case. I understand the need to stay with the same pantry for the same disk snapshot. But the issue I ran into is with new/separate snapshot requests still holding on to the unresponsive one.

luqmana commented 1 year ago

3783 partially addresses this for crdb. We now, grab all the cockroachdb hosts via internal dns at nexus startup and add them all to the connection string. Whenever a new connection is established it'll try the listed hosts in order and use the first one that successfully connects. While that's an improvement in that Nexus won't fail to serve requests if one crdb instance goes down, there are still some issues:

  1. The list of hosts used are resolved and get fixed at Nexus startup. If new cockroachdb zones get spun up, it'll take a restart of Nexus for it to pick them up. As-is, I don't think we're yet dynamically spinning up new services and instead just creating the service plan during RSS.
  2. Internal DNS returns records in the same order everytime (couldn't find existing issue so created one: #3805). This means every Nexus zone tries to connect to the set of CRDB hosts in the same order.
smklein commented 4 months ago

I'm poking at this issue again, now that we're looking at possibly expunging zones which are running CRDB nodes.

Where the System Currently Stands

CockroachDB nodes, when initializing, access internal DNS to find the IP addresses of other nodes within the cluster:

https://github.com/oxidecomputer/omicron/blob/cfa6bd94b544cf12fb6af2b54f1e8d1b25804231/smf/cockroachdb/method_script.sh#L13-L19

Nexus, when creating a connection to a pool, performs a "one-time DNS lookup" here:

https://github.com/oxidecomputer/omicron/blob/d80cd2926102034a3d5d4a0003e10b0f2a4eeb89/nexus/src/context.rs#L189-L224

So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":

What do we want to do

I've tried to dig into the CockroachDB and Postgres docs to explore our options in this area when constructing the postgresql:// URL from the Nexus side. The end-goal here would be to have an implementation where Nexus can easily migrate from querying one CockroachDB node to the next, as we continue removing old nodes and adding new ones.

Options

Specify a host address in the form of a hostname, use DNS to resolve

It should hopefully be possible to provide a hostname to the postgresql::// URL which identifies the CockroachDB service, and which uses our internal DNS server to resolve the name of the host.

As far as I can tell -- feedback welcome if folks see alternate pathways -- the mechanism to point postgres clients at a particular DNS server is by setting the resolv.conf file with the IP address of the desired internal DNS server.

For nexus, this would mean:

Subsequent experimentation is necessary to determine how failures are propagated back to Nexus when some of these nodes die, and to identify if new nodes are actually made accessible.

Specify multiple hosts during nexus-side construction of postgresql:// URL

It is possible to specify multiple hostnames, according to libpq:

See: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS

Pros: Would give us redundancy from single-node failure Cons: Would not actually update the set of "all known CRDB nodes", so after enough failures, this would also fall over

Experimentally, I spun up a three-node cluster, killed one, and observed via a psql shell that I could re-connect to another. However, it's also notable that this failure is visible to the client - my postgres client sent a request that failed (to the dead node) before reconnecting to a live one, and this process was not automated.

smklein commented 4 months ago

Here's the result of some of my experimentation. I'd really rather have a local test here - that's what I was trying to build out in https://github.com/oxidecomputer/omicron/pull/5628 - but it's quite difficult to create all this in an isolated environment, since changing the DNS server used by postgres relies on changing "system-wide" config in /etc/resolv.conf.

Setup

I spun up a DNS server within Omicron, sitting on port 53:

sudo ./target/debug/dns-server --http-address [::1]:0 --dns-address [::1]:53 --config-file $PWD/dns-server/examples/config.toml

I then spun up three CockroachDB nodes talking to each other within a cluster. These are all on localhost, on ports 7709, 7710, and 7711.

cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb1 --listen-addr [::1]:7709 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb2 --listen-addr [::1]:7710 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb3 --listen-addr [::1]:7711 --http-addr :0
cockroach init --insecure --host [::]:7709

Then I populated my internal DNS server with these records:

# SRV records
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test 

# AAAA records
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test ac33791c-62c6-43b0-bcbd-b15e7727b533.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test c6cda479-5fde-49a0-a079-7c960022baff.host ::1

Next, I added my nameserver running on localhost to /etc/resolv.conf:

# added to `resolv.conf`
nameserver ::1

To check that DNS is up and running, I used dig:

 $ dig _cockroach._tcp.control-plane.oxide.test

; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> _cockroach._tcp.control-plane.oxide.test
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38788
;; flags: qr rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.test. IN A

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test.
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test.
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test.

;; ADDITIONAL SECTION:
c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test. 0 IN AAAA ::1
ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test. 0 IN AAAA ::1
6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test. 0 IN AAAA ::1

;; Query time: 0 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Mon Apr 29 14:04:58 PDT 2024
;; MSG SIZE  rcvd: 400

Which looks like I'd expect -- I'm seeing those 7709 - 7711 ports in the SRV records, and a bunch of references to ::1 in the AAAA records.

I can use the Cockroach shell to connect directly to a node via IP:

cockroach sql --url "postgresql://root@[::1]:7709?sslmode=disable"

However, using a hostname appears to be hitting issues:

 $ cockroach sql --url "postgresql://root@_cockroach._tcp.control-plane.oxide.test?sslmode=disable"
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

dial tcp: lookup _cockroach._tcp.control-plane.oxide.test: no such host
Failed running "sql"
luqmana commented 4 months ago

Resolving using SRV records doesn't work with the cli (https://github.com/cockroachdb/cockroach/issues/64439).

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

smklein commented 4 months ago

Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439).

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

Thanks for the pointer, I'll look into this flag! To be clear, that would be for CockroachDB nodes to connect to each other using DNS, right?

Just being clear that it's distinct from any attempts by e.g. Nexus to use a libpq client to connect to CockroachDB

luqmana commented 4 months ago

Correct

On Mon, Apr 29, 2024, 3:03 PM Sean Klein @.***> wrote:

Resolving using SRV records doesn't work with the cli ( cockroachdb/cockroach#64439 https://github.com/cockroachdb/cockroach/issues/64439).

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

Thanks for the pointer, I'll look into this flag! To be clear, that would be for CockroachDB nodes to connect to each other using DNS, right?

Just being clear that it's distinct from any attempts by e.g. Nexus to use a libpq client to connect to CockroachDB

— Reply to this email directly, view it on GitHub https://github.com/oxidecomputer/omicron/issues/3763#issuecomment-2083760580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACGCVZN5SBTWMXN5CM5673Y727SRAVCNFSM6AAAAAA2XRPVFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBTG43DANJYGA . You are receiving this because you commented.Message ID: @.***>

smklein commented 4 months ago

Here's my follow-up -- I'm digging into libpq to try to figure out why DNS resolution isn't happening from Nexus, or any of the psql / cockroachdb sql clients:

https://github.com/postgres/postgres/blob/dd0183469bb779247c96e86c2272dca7ff4ec9e7/src/interfaces/libpq/fe-connect.c#L872

It appears the PQConnectStart method parses connection options, then calls pqConnectDBStart to actually try connecting to the backend. This itself calls PQconnectPoll after setting up some config options (most importantly though, setting status to CONNECTION_NEEDED.

I'm pretty sure I'm bottoming out here, because this matches the error I was seeing:

https://github.com/postgres/postgres/blob/dd0183469bb779247c96e86c2272dca7ff4ec9e7/src/interfaces/libpq/fe-connect.c#L2737-L2747

Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records -- this appears to match my experiments locally, where I spun up a small C Program, and could read from the "node name" of c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test, but not _cockroach._tcp.control-plane.oxide.test. This does, admittedly, defeat the point of doing DNS lookup, since that AAAA record would presumably change when the CRDB service is provisioned to a new node.

Out of curiosity, I dug into tokio-postgres too, to see if they did any better than libpq:

https://github.com/sfackler/rust-postgres/blob/98f5a11bc0a8e451552d8941ffa078c7eb6cd60c/tokio-postgres/src/connect.rs#L98

This calls into https://docs.rs/tokio/latest/tokio/net/fn.lookup_host.html , which also appears (through local testing) to "work with AAAA records when you also supply the port, but not with SRV records". This pretty much matches the getaddrinfo client issues libpq is facing.

smklein commented 4 months ago

I'm kinda getting the sense we have two options for a path forward here:

  1. Nexus resolves IP addresses by doing the SRV lookup itself. This is, more-or-less, what we're doing today. To tolerate CockroachDB node failures, however, we should strengthen this approach a bit, by providing multiple hosts to multiple CRDB nodes, and periodically refreshing our connections to postgres with an updated set of hosts that we re-acquire from DNS, again by doing the SRV lookups ourselves.

There are some drawbacks to this approach that we would need to work through:

  1. We patch libpq to add lookup via SRV. We aren't actually the first ones to propose this, but I think it would require us to do some upstream work, which may or may not be accepted, and would definitely introduce a delay to deployment.
luqmana commented 4 months ago

Currently, the connection manager APIs used by Nexus (and Diesel) are static -- the postgresql:// URL is supplied once, and exists for the duration of the connection pool. To implement something like this on top of such an API, I think we'd need to create a new connection pool and switch new connections over to it. Alternatively, we could modify the connection pooling libraries to change the URL which we use for connections dynamically, and internalize this detail to the pool itself.

Right, last I looked it involved changes across multiple crates:

We create an async-bb8-diesel::ConnectionManager with a fixed database URL string. That wraps diesel::r2d2::ConnectionManager and implements bb8::ManageConnection. In that is the blocking call to the underlying r2d2::ManageConnection::connect which is implemented directly in diesel and calls Connection::establish with the fixed url we created at startup.

smklein commented 4 months ago

I'll see if I can modify Diesel and async-bb8-diesel to make that URL mutable. That seems like it would help us form a foundation to modify things from nexus, if we are unable to rely on libpq to act as a DNS client on our behalf.

smklein commented 4 months ago

Okay, seems possible to let the database URL be mutable from Nexus. That would at least let Nexus update the "set of valid CRDB nodes" with the latest info it knows about.

See: https://github.com/oxidecomputer/async-bb8-diesel/pull/62 , https://github.com/diesel-rs/diesel/pull/4005

luqmana commented 4 months ago

Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing connect to somehow get the URL dynamically where we could just query DNS.

smklein commented 4 months ago

Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing connect to somehow get the URL dynamically where we could just query DNS.

Totally, but this was my takeaway from reading libpq's source: someone on the client-side needs to make a decision to go query DNS, get a new set of nodes, and update the set of IPs we end up talking to. If libpq handled this for us, that would be nice, but if it doesn't, I think it ends up being functionally somewhat similar for Nexus to take this responsibility too.

Arguably, I think Nexus could be more optimal here, since it can perform this action as a downstream operation of an RPW to avoid unnecessary queries to DNS until the set is known to have changed.

davepacheco commented 4 months ago

This use case is really similar to the case of managing HTTP clients for our internal services: there's DNS resolution that ideally would be decoupled from connection establishment, and connection establishment that would be decoupled from usage. This is another case where we'll want a more sophisticated connection management component, which likely means building our own pool to better control the behavior here. We basically decided in the update call today to pursue this, so the rest of this might be moot. But in case it's useful, here's responding to a few things above.


So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":

  • The CockroachDB nodes themselves should be resilient. I've tested this manually with a three-node setup, and as expected, forcefully killing one node means that the others remain responsive to requests.
  • The Nexus connection pool may become unresponsive, depending on which IP address was returned out of the original DNS lookup.

This is a good summary. The CockroachDB part (in the start method) was explicitly designed this way in order to stay in sync with the latest cluster topology and I'd be more surprised if it didn't work! But we've known from the start that there was at least some work to be done to make the pool survive database failures better, and likely we were going to have to do our own pool.


Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439). ... Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records

For what it's worth, I've almost never seen non-application-specific components do SRV lookups.

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

I think we've basically solved this part already (see above). We considered --experimental-dns-srv at the time but ruled it out, if for no other reason than it was documented as unstable and subject to removal or change. I feel like I also found some problem with its behavior but I do not remember what it was. Thinking out loud: such a facility would need: to use the right DNS servers (have we set up /etc/resolv.conf correctly), and deal well with those being down; and importantly if any of it fails it should sit there waiting for the right records to show up, not just give up and exit. It may have been this last one that was the sticking point.


I'm kinda getting the sense we have two options for a path forward here:

  1. Nexus resolves IP addresses by doing the SRV lookup itself...
  2. We patch libpq to add lookup via SRV...

It makes sense to me that the DNS resolution and even TCP connection establishment behavior would ultimately be application-specific and would happen outside libpq. While it's no doubt convenient that clients like tokio_postgres allow you to specify a DNS name at all, that falls apart in situations like ours because there are so many policy choices hardcoded here that we'd probably want to change: which nameservers to use? how many are queried? with what concurrency? on failure, do we fail or retry? how many times? how often? Then most of those same questions apply again for TCP connection establishment. And the challenge is that we likely want to make these choices at a high layer, not deep in tokio_postgres when we're in the process of trying to establish a connection for a client that's currently waiting on it.

askfongjojo commented 3 months ago

In my most recent testing, the behavior for CRDB connection failures has changed (or might be expected?). When the crdb node in use has gone down, I saw messages like this in nexus logs:

17:42:35.701Z ERRO 2898657e-4141-4c05-851b-147bffc6bbbd (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
    error_message = Connection error: server is shutting down
    file = nexus/db-queries/src/db/pool.rs:100

(Note: The crdb instance in use is not necessarily the first one listed; in my case, disabling the 105 instance is what triggered the error.)

Nexus was able to continue serving both read and update API requests in spite of the above error. It's possible that any ongoing requests against the crdb node right at the moment it's going down could have failed but all requests I made afterwards succeeded.

@davepacheco took a look at the code to understand what changed the behavior. Here are his comments:

I didn't think we'd changed anything here since https://github.com/oxidecomputer/omicron/issues/3763 was filed, but I see there was https://github.com/oxidecomputer/omicron/issues/3763#issuecomment-1659306277, which maybe fixed this? Although subsequent comments suggested it might not have, but it seems like it should have. Anyway, it's good that it worked.

smklein commented 2 months ago

Just to be explicit: https://github.com/oxidecomputer/omicron/pull/5876 will fix this issue for CockroachDB -- in that PR, we use https://github.com/oxidecomputer/qorb to access each CRDB node individually, and create a pool of connections for each one.