Open askfongjojo opened 1 year ago
I think ignoring the TTL makes sense (not that it's right, just that it makes sense) in the context of this code:
Our API to interface with CockroachDB from Nexus involves "constructing a URL", and right now, we embed a single point-in-time address of CockroachDB into this URL.
We should probably avoid doing this, and embed the service name into the URL, so it actually use DNS during later lookups.
Understood. The current behavior, in conjunction with #3613, results in an 1 in 15 probability that a user will not be able to use the system at all. I happened to be in this situation when my workstation always used the one nexus that had the first CRDB backend I brought down to test failover. :(
The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.
In this case, there's no TTL: a specific pantry from the list of available ones is selected once and used for a disk until that disk is finalized.
One thing that could be done is to check of the selected pantry is still responsive and choose another one if not, but it was important not keep as much code out of the import chunk hot-path as possible, as any checks there will be multiplied by the number of chunks to import and slow down imports. That being said, slow imports are better than non-working imports :) I'll give this some thought.
Sorry for not being clear in pantry's case. I understand the need to stay with the same pantry for the same disk snapshot. But the issue I ran into is with new/separate snapshot requests still holding on to the unresponsive one.
I'm poking at this issue again, now that we're looking at possibly expunging zones which are running CRDB nodes.
CockroachDB nodes, when initializing, access internal DNS to find the IP addresses of other nodes within the cluster:
Nexus, when creating a connection to a pool, performs a "one-time DNS lookup" here:
So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":
I've tried to dig into the CockroachDB and Postgres docs to explore our options in this area when constructing the postgresql://
URL from the Nexus side. The end-goal here would be to have an implementation where Nexus can easily migrate from querying one CockroachDB node to the next, as we continue removing old nodes and adding new ones.
It should hopefully be possible to provide a hostname to the postgresql:://
URL which identifies the CockroachDB service, and which uses our internal DNS server to resolve the name of the host.
As far as I can tell -- feedback welcome if folks see alternate pathways -- the mechanism to point postgres
clients at a particular DNS server is by setting the resolv.conf
file with the IP address of the desired internal DNS server.
For nexus, this would mean:
/etc/resolv.conf
file which includes a nameserver
entry pointing to the IP address of our internal DNS serverpostgresl://
url to refer to the Cockroach service, and relying on the nameserver to perform translation.Subsequent experimentation is necessary to determine how failures are propagated back to Nexus when some of these nodes die, and to identify if new nodes are actually made accessible.
postgresql://
URLIt is possible to specify multiple hostnames, according to libpq:
See: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS
Pros: Would give us redundancy from single-node failure Cons: Would not actually update the set of "all known CRDB nodes", so after enough failures, this would also fall over
Experimentally, I spun up a three-node cluster, killed one, and observed via a psql
shell that I could re-connect to another. However, it's also notable that this failure is visible to the client - my postgres client sent a request that failed (to the dead node) before reconnecting to a live one, and this process was not automated.
Here's the result of some of my experimentation. I'd really rather have a local test here - that's what I was trying to build out in https://github.com/oxidecomputer/omicron/pull/5628 - but it's quite difficult to create all this in an isolated environment, since changing the DNS server used by postgres relies on changing "system-wide" config in /etc/resolv.conf
.
I spun up a DNS server within Omicron, sitting on port 53:
sudo ./target/debug/dns-server --http-address [::1]:0 --dns-address [::1]:53 --config-file $PWD/dns-server/examples/config.toml
I then spun up three CockroachDB nodes talking to each other within a cluster. These are all on localhost, on ports 7709, 7710, and 7711.
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb1 --listen-addr [::1]:7709 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb2 --listen-addr [::1]:7710 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb3 --listen-addr [::1]:7711 --http-addr :0
cockroach init --insecure --host [::]:7709
Then I populated my internal DNS server with these records:
# SRV records
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test
# AAAA records
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test ac33791c-62c6-43b0-bcbd-b15e7727b533.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test c6cda479-5fde-49a0-a079-7c960022baff.host ::1
Next, I added my nameserver running on localhost to /etc/resolv.conf
:
# added to `resolv.conf`
nameserver ::1
To check that DNS is up and running, I used dig:
$ dig _cockroach._tcp.control-plane.oxide.test
; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> _cockroach._tcp.control-plane.oxide.test
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38788
;; flags: qr rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.test. IN A
;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test.
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test.
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test.
;; ADDITIONAL SECTION:
c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test. 0 IN AAAA ::1
ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test. 0 IN AAAA ::1
6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test. 0 IN AAAA ::1
;; Query time: 0 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Mon Apr 29 14:04:58 PDT 2024
;; MSG SIZE rcvd: 400
Which looks like I'd expect -- I'm seeing those 7709 - 7711 ports in the SRV records, and a bunch of references to ::1
in the AAAA records.
I can use the Cockroach shell to connect directly to a node via IP:
cockroach sql --url "postgresql://root@[::1]:7709?sslmode=disable"
However, using a hostname appears to be hitting issues:
$ cockroach sql --url "postgresql://root@_cockroach._tcp.control-plane.oxide.test?sslmode=disable"
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
dial tcp: lookup _cockroach._tcp.control-plane.oxide.test: no such host
Failed running "sql"
Resolving using SRV
records doesn't work with the cli (https://github.com/cockroachdb/cockroach/issues/64439).
But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join
(though that's perhaps behind another flag --experimental-srv-dns
depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.
Resolving using
SRV
records doesn't work with the cli (cockroachdb/cockroach#64439).But you should be able to have the nodes discover each other via DNS without explicitly listing them out for
--join
(though that's perhaps behind another flag--experimental-srv-dns
depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.
Thanks for the pointer, I'll look into this flag! To be clear, that would be for CockroachDB nodes to connect to each other using DNS, right?
Just being clear that it's distinct from any attempts by e.g. Nexus to use a libpq client to connect to CockroachDB
Correct
On Mon, Apr 29, 2024, 3:03 PM Sean Klein @.***> wrote:
Resolving using SRV records doesn't work with the cli ( cockroachdb/cockroach#64439 https://github.com/cockroachdb/cockroach/issues/64439).
But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.
Thanks for the pointer, I'll look into this flag! To be clear, that would be for CockroachDB nodes to connect to each other using DNS, right?
Just being clear that it's distinct from any attempts by e.g. Nexus to use a libpq client to connect to CockroachDB
— Reply to this email directly, view it on GitHub https://github.com/oxidecomputer/omicron/issues/3763#issuecomment-2083760580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACGCVZN5SBTWMXN5CM5673Y727SRAVCNFSM6AAAAAA2XRPVFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBTG43DANJYGA . You are receiving this because you commented.Message ID: @.***>
Here's my follow-up -- I'm digging into libpq
to try to figure out why DNS resolution isn't happening from Nexus, or any of the psql
/ cockroachdb sql
clients:
It appears the PQConnectStart
method parses connection options, then calls pqConnectDBStart
to actually try connecting to the backend. This itself calls PQconnectPoll
after setting up some config options (most importantly though, setting status
to CONNECTION_NEEDED
.
I'm pretty sure I'm bottoming out here, because this matches the error I was seeing:
Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records -- this appears to match my experiments locally, where I spun up a small C Program, and could read from the "node name" of c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test
, but not _cockroach._tcp.control-plane.oxide.test
. This does, admittedly, defeat the point of doing DNS lookup, since that AAAA record would presumably change when the CRDB service is provisioned to a new node.
Out of curiosity, I dug into tokio-postgres
too, to see if they did any better than libpq:
This calls into https://docs.rs/tokio/latest/tokio/net/fn.lookup_host.html , which also appears (through local testing) to "work with AAAA records when you also supply the port, but not with SRV records". This pretty much matches the getaddrinfo
client issues libpq is facing.
I'm kinda getting the sense we have two options for a path forward here:
There are some drawbacks to this approach that we would need to work through:
postgresql://
URL is supplied once, and exists for the duration of the connection pool. To implement something like this on top of such an API, I think we'd need to create a new connection pool and switch new connections over to it. Alternatively, we could modify the connection pooling libraries to change the URL which we use for connections dynamically, and internalize this detail to the pool itself.Currently, the connection manager APIs used by Nexus (and Diesel) are static -- the postgresql:// URL is supplied once, and exists for the duration of the connection pool. To implement something like this on top of such an API, I think we'd need to create a new connection pool and switch new connections over to it. Alternatively, we could modify the connection pooling libraries to change the URL which we use for connections dynamically, and internalize this detail to the pool itself.
Right, last I looked it involved changes across multiple crates:
We create an async-bb8-diesel::ConnectionManager
with a fixed database URL string. That wraps diesel::r2d2::ConnectionManager
and implements bb8::ManageConnection
. In that is the blocking call to the underlying r2d2::ManageConnection::connect
which is implemented directly in diesel
and calls Connection::establish
with the fixed url we created at startup.
I'll see if I can modify Diesel and async-bb8-diesel to make that URL mutable. That seems like it would help us form a foundation to modify things from nexus, if we are unable to rely on libpq to act as a DNS client on our behalf.
Okay, seems possible to let the database URL be mutable from Nexus. That would at least let Nexus update the "set of valid CRDB nodes" with the latest info it knows about.
See: https://github.com/oxidecomputer/async-bb8-diesel/pull/62 , https://github.com/diesel-rs/diesel/pull/4005
Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing connect
to somehow get the URL dynamically where we could just query DNS.
Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing
connect
to somehow get the URL dynamically where we could just query DNS.
Totally, but this was my takeaway from reading libpq's source: someone on the client-side needs to make a decision to go query DNS, get a new set of nodes, and update the set of IPs we end up talking to. If libpq handled this for us, that would be nice, but if it doesn't, I think it ends up being functionally somewhat similar for Nexus to take this responsibility too.
Arguably, I think Nexus could be more optimal here, since it can perform this action as a downstream operation of an RPW to avoid unnecessary queries to DNS until the set is known to have changed.
This use case is really similar to the case of managing HTTP clients for our internal services: there's DNS resolution that ideally would be decoupled from connection establishment, and connection establishment that would be decoupled from usage. This is another case where we'll want a more sophisticated connection management component, which likely means building our own pool to better control the behavior here. We basically decided in the update call today to pursue this, so the rest of this might be moot. But in case it's useful, here's responding to a few things above.
So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":
- The CockroachDB nodes themselves should be resilient. I've tested this manually with a three-node setup, and as expected, forcefully killing one node means that the others remain responsive to requests.
- The Nexus connection pool may become unresponsive, depending on which IP address was returned out of the original DNS lookup.
This is a good summary. The CockroachDB part (in the start method) was explicitly designed this way in order to stay in sync with the latest cluster topology and I'd be more surprised if it didn't work! But we've known from the start that there was at least some work to be done to make the pool survive database failures better, and likely we were going to have to do our own pool.
Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439). ... Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records
For what it's worth, I've almost never seen non-application-specific components do SRV lookups.
But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.
I think we've basically solved this part already (see above). We considered --experimental-dns-srv
at the time but ruled it out, if for no other reason than it was documented as unstable and subject to removal or change. I feel like I also found some problem with its behavior but I do not remember what it was. Thinking out loud: such a facility would need: to use the right DNS servers (have we set up /etc/resolv.conf correctly), and deal well with those being down; and importantly if any of it fails it should sit there waiting for the right records to show up, not just give up and exit. It may have been this last one that was the sticking point.
I'm kinda getting the sense we have two options for a path forward here:
- Nexus resolves IP addresses by doing the SRV lookup itself...
- We patch libpq to add lookup via SRV...
It makes sense to me that the DNS resolution and even TCP connection establishment behavior would ultimately be application-specific and would happen outside libpq. While it's no doubt convenient that clients like tokio_postgres
allow you to specify a DNS name at all, that falls apart in situations like ours because there are so many policy choices hardcoded here that we'd probably want to change: which nameservers to use? how many are queried? with what concurrency? on failure, do we fail or retry? how many times? how often? Then most of those same questions apply again for TCP connection establishment. And the challenge is that we likely want to make these choices at a high layer, not deep in tokio_postgres
when we're in the process of trying to establish a connection for a client that's currently waiting on it.
In my most recent testing, the behavior for CRDB connection failures has changed (or might be expected?). When the crdb node in use has gone down, I saw messages like this in nexus logs:
17:42:35.701Z ERRO 2898657e-4141-4c05-851b-147bffc6bbbd (ServerContext): database connection error
database_url = postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
error_message = Connection error: server is shutting down
file = nexus/db-queries/src/db/pool.rs:100
(Note: The crdb instance in use is not necessarily the first one listed; in my case, disabling the 105 instance is what triggered the error.)
Nexus was able to continue serving both read and update API requests in spite of the above error. It's possible that any ongoing requests against the crdb node right at the moment it's going down could have failed but all requests I made afterwards succeeded.
@davepacheco took a look at the code to understand what changed the behavior. Here are his comments:
I didn't think we'd changed anything here since https://github.com/oxidecomputer/omicron/issues/3763 was filed, but I see there was https://github.com/oxidecomputer/omicron/issues/3763#issuecomment-1659306277, which maybe fixed this? Although subsequent comments suggested it might not have, but it seems like it should have. Anyway, it's good that it worked.
Just to be explicit: https://github.com/oxidecomputer/omicron/pull/5876 will fix this issue for CockroachDB -- in that PR, we use https://github.com/oxidecomputer/qorb to access each CRDB node individually, and create a pool of connections for each one.
While testing service failover on rack2, I noticed that nexus held on to the same cockroachdb backend without attempting to use any of the other nodes in the 5-node database cluster, causing requests to fail until the one it favored came back.
The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.