Closed jgallagher closed 14 hours ago
Looks like we failed to listen on a given port: https://buildomat.eng.oxide.computer/wg/0/artefact/01J8NBGFP64GFZMFMX6AANQZ9Q/2JlsoaNKXCzjK8M2hVXhgQ2IGRNjSgL2OQpZW7XlLcTeifnn/01J8NBH6JNRNGS9Z9AZMZB7BYB/01J8NH4A9GV5305DS4B0SWJGY2/clickhouse-keeper.err.log?format=x-bunyan
Peers 2 and 3 cannot connect to peer 1. This appears to be because peer 1 could not bind to its TCP port:
1 | | 2024.09.25 20:43:43.399296 [ 1 ] {}
KeeperLogStore: No logs exists in /var/tmp/omicron_tmp/integration_test-13fe08fd3bd99210-test_raft_config_parsing.10491.1-oximeter-clickward-test/keeper-1/coordination/log. It's Ok if it's the first run of clickhouse-keeper. 2 | | 2024.09.25 20:43:43.399690 [ 1 ] {} KeeperLogStore: Removing all changelogs 3 | | 2024.09.25 20:43:43.401841 [ 1 ] {} KeeperStateManager: No state was read 4 | | 2024.09.25 20:43:44.521426 [ 13 ] {} RaftInstance: Election timeout, initiate leader election 5 | | 2024.09.25 20:43:45.044039 [ 16 ] {} RaftInstance: failed to accept a rpc connection due to error 47, Operation canceled 6 | | 2024.09.25 20:43:45.055806 [ 9 ] {} KeeperLogStore: Raft server is not set in LogStore. 7 | | 2024.09.25 20:43:45.072153 [ 1 ] {} KeeperLogStore: Changelog is shut down 8 | | 2024.09.25 20:43:45.093563 [ 1 ] {} Application: Code: 210. DB::Exception: Listen [::1]:39001 failed: Poco::Exception. Code: 1000, e.code() = 125, Net Exception: Address already in use: [::1]:39001 (version 23.8.7.1). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below): 9 |
Unfortunately, replicated clickhouse forces us to allocate ports ahead of time and it's always possible that another test running in parallel can steal our port. We could repeatedly tear down the cluster and retry until we succeed, which seems to be the best option, if we somehow could know about the bind problem. The other alternative is to just pass the test in this case, as this type of flake should be pretty rare. The real problem though is detecting that this is a bind error.
Being forced to allocate ports ahead of time makes me sad. Is it tests that are conflicting with other tests? Ugh.
Being forced to allocate ports ahead of time makes me sad. Is it tests that are conflicting with other tests? Ugh.
Yeah, me too. I don't remember who recommended it, but we could probably get away with allocating a different temporary IP address for each test to avoid this.
Can it use UNIX sockets instead of TCP ports?
@jclulow Not so far as I know, it really only supports TCP as a transport. Later versions of ClickHouse have a way to do dynamic cluster reconfiguration, but it's not clear if that could be made to work. You have to start up the cluster first, and then reconfigure it, and it's that first step that requires knowing the ports ahead of time.
How many TCP listen sockets does each child process actually open? Is it just one?
I believe it's two: one for the inter-node communication, and one bound for external clients.
Oh, another question: is 39001 something we pick explicitly? It would seem to be in the anonymous/ephemeral port range:
$ ipadm show-prop -p smallest_anon_port,largest_anon_port tcp
PROTO PROPERTY PERM CURRENT PERSISTENT DEFAULT POSSIBLE
tcp smallest_anon_port rw 32768 -- 32768 1024-65535
tcp largest_anon_port rw 65535 -- 65535 1024-65535
I think this means that the system might be picking it at random for use as a source port for TCP connections, which I think would conflict with a bind(3SOCKET) to the same port number?
I'm not sure if the bind failed because there was something already bound, or if a client was using it to initiate connections. I think the way we have the tests set up, either kind of conflict is possible. We start a number of servers that bind any available ephemeral port, e.g., CRDB. We also obviously have a lot of clients making requests in the tests too, and those get some ephemeral port as well. None of these should conflict with each other.
The hard-coding is of the port to which we want the ClickHouse keeper to bind. If that port is in use, for either of the above reasons, then we're hosed.
Oh, another question: is 39001 something we pick explicitly? It would seem to be in the anonymous/ephemeral port range:
$ ipadm show-prop -p smallest_anon_port,largest_anon_port tcp PROTO PROPERTY PERM CURRENT PERSISTENT DEFAULT POSSIBLE tcp smallest_anon_port rw 32768 -- 32768 1024-65535 tcp largest_anon_port rw 65535 -- 65535 1024-65535
I think this means that the system might be picking it at random for use as a source port for TCP connections, which I think would conflict with a bind(3SOCKET) to the same port number?
Ok, this is good to know. This port is entirely configurable. I can take it out of the ephemeral range and we the problem should not occur. I think that's why earlier tests didn't have this problem and have been running for months. They weren't using ports from the ephemeral range.
the problem should not occur
As long as we're only running one cluster at a time, right?
I'm not sure if the bind failed because there was something already bound, or if a client was using it to initiate connections. I think the way we have the tests set up, either kind of conflict is possible.
I agree that they would both presumably result in that error you see up there; i.e., I can't, it's in use. From a diagnostic perspective it would probably help to run netstat -uan
(as root) or connstat
or something right away, to at least collect the list of open TCP connections around that time.
We start a number of servers that bind any available ephemeral port, e.g., CRDB. We also obviously have a lot of clients making requests in the tests too, and those get some ephemeral port as well. None of these should conflict with each other.
If you're talking about using a 0 value in sin_port during a bind(3SOCKET) call to have the OS dynamically allocate a listen port, note that as per inet(4P):
The sin_port field of the sockaddr_in structure specifies a port number
used by TCP or UDP.
........
If the local port address is
specified as 0, then the system picks a unique port address greater than
IPPORT_RESERVED.
The value IPPORT_RESERVED is port 1024; i.e., the highest privileged port (for which root access is required to bind). As far as I know it's not also restricted to the anonymous range.
The hard-coding is of the port to which we want the ClickHouse keeper to bind. If that port is in use, for either of the above reasons, then we're hosed.
Yeah, that's tough. Do you know if it will accept a 0 in those fields, at least, to get a random port number? If we could then figure out a useful way to discover what had been picked?
the problem should not occur
As long as we're only running one cluster at a time, right?
No, we just put them all on separate ports. The ports are configurable.
Yeah, that's tough. Do you know if it will accept a 0 in those fields, at least, to get a random port number? If we could then figure out a useful way to discover what had been picked?
We can't put a zero here. All the nodes must know all the ports. I've tried to think through staging mechanisms to do this, but it's a lot of code change and I'm not sure it would work.
So are you saying that ephemeral listeners that use port 0 don't only pull from the anonymous range? If that's the case then this is unsolvable without stronger isolation.
I'm not actually sure -- I can go code spelunking to find out! I was just calling out what's documented as the stable behaviour; it's totally possible it's more strict in practice and we should fix the manual.
I agree that it not being limited to the anonymous ports would seem a bit shithouse.
I'm not actually sure -- I can go code spelunking to find out! I was just calling out what's documented as the stable behaviour; it's totally possible it's more strict in practice and we should fix the manual.
I agree that it not being limited to the anonymous ports would seem a bit shithouse.
That would be appreciated. Someday I should learn how to spelunk the OS code myself lol.
In the meantime, I have PR coming that will steer clear of the anonymous range, in case that is where ephemeral ports come from. In any case, changing these ports doesn't hurt.
This is a tentative fix if the system only binds to ports in the anonymous range: https://github.com/oxidecomputer/omicron/pull/6713
Looking briefly it seems like that one is prior to the fix with the new port numbers:
2024.09.28 15:28:38.355194 [ 43351 ] {} <Error> Application: DB::Exception: Listen [::1]:39003 failed: Poco::Exception. Code: 1000, e.code() = 98, Net Exception: Address already in use: [::1]:39003 (version 23.8.7.24 (official build))
Yeah, I'm caught up to main as of last night and the fix is not in. Was just dropping a link in case there is further useful information in the CI run.
Given how frequently this is being hit and the fact that we haven't ever seen it with our existing replica tests that don't allocate from the anonymous range, I'm going to guess that changing the ports as in #6713 will fix this. We can re-open if it gets hit again.
This test failed on a CI run on #6668:
https://github.com/oxidecomputer/omicron/pull/6668/checks?check_run_id=30667041839
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01J8NBGFP64GFZMFMX6AANQZ9Q/2JlsoaNKXCzjK8M2hVXhgQ2IGRNjSgL2OQpZW7XlLcTeifnn/01J8NBH6JNRNGS9Z9AZMZB7BYB
Excerpt from the log showing the failure:
Cc @karencfv @andrewjstone