test failed in CI: test_raft_config_parsing timeout

jgallagher commented 3 days ago

This test failed on a CI run on #6668:

https://github.com/oxidecomputer/omicron/pull/6668/checks?check_run_id=30667041839

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J8NBGFP64GFZMFMX6AANQZ9Q/2JlsoaNKXCzjK8M2hVXhgQ2IGRNjSgL2OQpZW7XlLcTeifnn/01J8NBH6JNRNGS9Z9AZMZB7BYB

Excerpt from the log showing the failure:

--- STDOUT:              omicron-clickhouse-admin::integration_test test_raft_config_parsing ---

running 1 test
Deploying keeper: /var/tmp/omicron_tmp/integration_test-13fe08fd3bd99210-test_raft_config_parsing.10491.1-oximeter-clickward-test/keeper-3
Deploying keeper: /var/tmp/omicron_tmp/integration_test-13fe08fd3bd99210-test_raft_config_parsing.10491.1-oximeter-clickward-test/keeper-1
Deploying keeper: /var/tmp/omicron_tmp/integration_test-13fe08fd3bd99210-test_raft_config_parsing.10491.1-oximeter-clickward-test/keeper-2
Deploying clickhouse server: /var/tmp/omicron_tmp/integration_test-13fe08fd3bd99210-test_raft_config_parsing.10491.1-oximeter-clickward-test/clickhouse-1
test test_raft_config_parsing ... FAILED

failures:

failures:
    test_raft_config_parsing

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 1 filtered out; finished in 33.67s

--- STDERR:              omicron-clickhouse-admin::integration_test test_raft_config_parsing ---
log file: /var/tmp/omicron_tmp/integration_test-13fe08fd3bd99210-test_raft_config_parsing.10491.0.log
note: configured to log to "/var/tmp/omicron_tmp/integration_test-13fe08fd3bd99210-test_raft_config_parsing.10491.0.log"
Error: failed to contact all keepers: [KeeperId(1), KeeperId(2), KeeperId(3)]

Caused by:
    timed out after 33.023433142s

Cc @karencfv @andrewjstone

andrewjstone commented 3 days ago

Looks like we failed to listen on a given port: https://buildomat.eng.oxide.computer/wg/0/artefact/01J8NBGFP64GFZMFMX6AANQZ9Q/2JlsoaNKXCzjK8M2hVXhgQ2IGRNjSgL2OQpZW7XlLcTeifnn/01J8NBH6JNRNGS9Z9AZMZB7BYB/01J8NH4A9GV5305DS4B0SWJGY2/clickhouse-keeper.err.log?format=x-bunyan

Peers 2 and 3 cannot connect to peer 1. This appears to be because peer 1 could not bind to its TCP port:

1 | | 2024.09.25 20:43:43.399296 [ 1 ] {} KeeperLogStore: No logs exists in /var/tmp/omicron_tmp/integration_test-13fe08fd3bd99210-test_raft_config_parsing.10491.1-oximeter-clickward-test/keeper-1/coordination/log. It's Ok if it's the first run of clickhouse-keeper. 2 | | 2024.09.25 20:43:43.399690 [ 1 ] {} KeeperLogStore: Removing all changelogs 3 | | 2024.09.25 20:43:43.401841 [ 1 ] {} KeeperStateManager: No state was read 4 | | 2024.09.25 20:43:44.521426 [ 13 ] {} RaftInstance: Election timeout, initiate leader election 5 | | 2024.09.25 20:43:45.044039 [ 16 ] {} RaftInstance: failed to accept a rpc connection due to error 47, Operation canceled 6 | | 2024.09.25 20:43:45.055806 [ 9 ] {} KeeperLogStore: Raft server is not set in LogStore. 7 | | 2024.09.25 20:43:45.072153 [ 1 ] {} KeeperLogStore: Changelog is shut down 8 | | 2024.09.25 20:43:45.093563 [ 1 ] {} Application: Code: 210. DB::Exception: Listen [::1]:39001 failed: Poco::Exception. Code: 1000, e.code() = 125, Net Exception: Address already in use: [::1]:39001 (version 23.8.7.1). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below): 9 |

Unfortunately, replicated clickhouse forces us to allocate ports ahead of time and it's always possible that another test running in parallel can steal our port. We could repeatedly tear down the cluster and retry until we succeed, which seems to be the best option, if we somehow could know about the bind problem. The other alternative is to just pass the test in this case, as this type of flake should be pretty rare. The real problem though is detecting that this is a bind error.

sunshowers commented 2 days ago

Being forced to allocate ports ahead of time makes me sad. Is it tests that are conflicting with other tests? Ugh.

andrewjstone commented 2 days ago

Being forced to allocate ports ahead of time makes me sad. Is it tests that are conflicting with other tests? Ugh.

Yeah, me too. I don't remember who recommended it, but we could probably get away with allocating a different temporary IP address for each test to avoid this.

jclulow commented 2 days ago

Can it use UNIX sockets instead of TCP ports?

bnaecker commented 2 days ago

@jclulow Not so far as I know, it really only supports TCP as a transport. Later versions of ClickHouse have a way to do dynamic cluster reconfiguration, but it's not clear if that could be made to work. You have to start up the cluster first, and then reconfigure it, and it's that first step that requires knowing the ports ahead of time.

jclulow commented 2 days ago

How many TCP listen sockets does each child process actually open? Is it just one?

bnaecker commented 2 days ago

I believe it's two: one for the inter-node communication, and one bound for external clients.

jclulow commented 2 days ago

Oh, another question: is 39001 something we pick explicitly? It would seem to be in the anonymous/ephemeral port range:

$ ipadm show-prop -p smallest_anon_port,largest_anon_port tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   smallest_anon_port    rw   32768        --           32768        1024-65535
tcp   largest_anon_port     rw   65535        --           65535        1024-65535

I think this means that the system might be picking it at random for use as a source port for TCP connections, which I think would conflict with a bind(3SOCKET) to the same port number?

bnaecker commented 2 days ago

I'm not sure if the bind failed because there was something already bound, or if a client was using it to initiate connections. I think the way we have the tests set up, either kind of conflict is possible. We start a number of servers that bind any available ephemeral port, e.g., CRDB. We also obviously have a lot of clients making requests in the tests too, and those get some ephemeral port as well. None of these should conflict with each other.

The hard-coding is of the port to which we want the ClickHouse keeper to bind. If that port is in use, for either of the above reasons, then we're hosed.

andrewjstone commented 2 days ago

Oh, another question: is 39001 something we pick explicitly? It would seem to be in the anonymous/ephemeral port range:
$ ipadm show-prop -p smallest_anon_port,largest_anon_port tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   smallest_anon_port    rw   32768        --           32768        1024-65535
tcp   largest_anon_port     rw   65535        --           65535        1024-65535
I think this means that the system might be picking it at random for use as a source port for TCP connections, which I think would conflict with a bind(3SOCKET) to the same port number?

Ok, this is good to know. This port is entirely configurable. I can take it out of the ephemeral range and we the problem should not occur. I think that's why earlier tests didn't have this problem and have been running for months. They weren't using ports from the ephemeral range.

bnaecker commented 2 days ago

the problem should not occur

As long as we're only running one cluster at a time, right?

jclulow commented 2 days ago

I'm not sure if the bind failed because there was something already bound, or if a client was using it to initiate connections. I think the way we have the tests set up, either kind of conflict is possible.

I agree that they would both presumably result in that error you see up there; i.e., I can't, it's in use. From a diagnostic perspective it would probably help to run netstat -uan (as root) or connstat or something right away, to at least collect the list of open TCP connections around that time.

We start a number of servers that bind any available ephemeral port, e.g., CRDB. We also obviously have a lot of clients making requests in the tests too, and those get some ephemeral port as well. None of these should conflict with each other.

If you're talking about using a 0 value in sin_port during a bind(3SOCKET) call to have the OS dynamically allocate a listen port, note that as per inet(4P):

   The sin_port field of the sockaddr_in structure specifies a port number
   used by TCP or UDP.
                            ........
                                              If the local port address is
   specified as 0, then the system picks a unique port address greater than
   IPPORT_RESERVED.

The value IPPORT_RESERVED is port 1024; i.e., the highest privileged port (for which root access is required to bind). As far as I know it's not also restricted to the anonymous range.

The hard-coding is of the port to which we want the ClickHouse keeper to bind. If that port is in use, for either of the above reasons, then we're hosed.

Yeah, that's tough. Do you know if it will accept a 0 in those fields, at least, to get a random port number? If we could then figure out a useful way to discover what had been picked?

andrewjstone commented 2 days ago

the problem should not occur

As long as we're only running one cluster at a time, right?

No, we just put them all on separate ports. The ports are configurable.

andrewjstone commented 2 days ago

Yeah, that's tough. Do you know if it will accept a 0 in those fields, at least, to get a random port number? If we could then figure out a useful way to discover what had been picked?

We can't put a zero here. All the nodes must know all the ports. I've tried to think through staging mechanisms to do this, but it's a lot of code change and I'm not sure it would work.

So are you saying that ephemeral listeners that use port 0 don't only pull from the anonymous range? If that's the case then this is unsolvable without stronger isolation.

jclulow commented 2 days ago

I'm not actually sure -- I can go code spelunking to find out! I was just calling out what's documented as the stable behaviour; it's totally possible it's more strict in practice and we should fix the manual.

I agree that it not being limited to the anonymous ports would seem a bit shithouse.

andrewjstone commented 2 days ago

I'm not actually sure -- I can go code spelunking to find out! I was just calling out what's documented as the stable behaviour; it's totally possible it's more strict in practice and we should fix the manual.

I agree that it not being limited to the anonymous ports would seem a bit shithouse.

That would be appreciated. Someday I should learn how to spelunk the OS code myself lol.

In the meantime, I have PR coming that will steer clear of the anonymous range, in case that is where ephemeral ports come from. In any case, changing these ports doesn't hurt.

andrewjstone commented 2 days ago

This is a tentative fix if the system only binds to ports in the anonymous range: https://github.com/oxidecomputer/omicron/pull/6713

rcgoodfellow commented 1 day ago

Hit this here

https://buildomat.eng.oxide.computer/wg/0/details/01J8VKMNBKZS2RNZTTXDTVPNSC/cGg1jnzWWBWzWqK5e1smiu0D2RmtZGDpa1G6gycIh46Vbjv0/01J8WK1J8AYPXQPFWK6E1N1932

jclulow commented 1 day ago

Looking briefly it seems like that one is prior to the fix with the new port numbers:

2024.09.28 15:28:38.355194 [ 43351 ] {} <Error> Application: DB::Exception: Listen [::1]:39003 failed: Poco::Exception. Code: 1000, e.code() = 98, Net Exception: Address already in use: [::1]:39003 (version 23.8.7.24 (official build))

rcgoodfellow commented 1 day ago

Yeah, I'm caught up to main as of last night and the fix is not in. Was just dropping a link in case there is further useful information in the CI run.

andrewjstone commented 14 hours ago

Given how frequently this is being hit and the fact that we haven't ever seen it with our existing replica tests that don't allocate from the anonymous range, I'm going to guess that changing the ports as in #6713 will fix this. We can re-open if it gets hit again.

oxidecomputer / omicron

test failed in CI: test_raft_config_parsing timeout #6689