Closed karencfv closed 2 months ago
Unassigned myself from this as I have to focus on other work for the time being
I recommend that we close this as not planned. I believe we'll always want the ability to run single-node functionality for tests that interact with the oximeter
database itself. Many of those tests do not care if their running against a replicated or single-node table, e.g., something that just inserts data and fetches it back out to verify serialization. For those tests which do care, we already have the ability to start a replicated cluster.
Heya @bnaecker! I raise this point in the "Open Questions" section of RFD 468. Do you mind writing your concerns there so we can discuss there and keep a record of why we made a choice over another?
What I'm talking about is orthogonal to how we deploy ClickHouse on production racks. I don't have much insight to add to whether we should remove deployment of single-node ClickHouse zones in those environments. You and @andrewjstone have thought much more deeply about that at this point that me!
I'm asking more about test and development environments. This issue seems to say we should remove the ability to run single-node ClickHouse completely, including in unit tests, integration tests, or other development situations. I disagree. I find it extremely valuable to quickly run tests that rely on an oximeter
database, and spinning up a single-node version is many times faster and more reliable than a cluster. I also think it's a non-starter to completely serialize every single Nexus integration test, as we'd have to if we want to reliably run ClickHouse clusters during those tests.
I'd be fine keeping this issue open, if you'd prefer to refine it so that it refers to production environments, e.g., the deployment of single-node ClickHouse zones by the sled-agent's service manager.
Ah! gotcha. Yeah, I can see how spinning up an entire cluster to test a quick thing could be annoying. The only downside I see to keeping single node functionality is maintaining two different sets of SQL init files.
In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files? This configuration starts up a lot faster that starting up a bunch of replicas and keepers, but I'm not sure if it's fast enough for what you need. WDYT?
In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files?
Unless I'm missing something, that's not quite enough. That will still require us to completely serialize all the tests that deploy ClickHouse. The problem is the port numbers. I don't think we've found a way to start a cluster where the OS assigns the ports, specifically to the Keepers, right? Without that, there is no way to reliably start two clusters at once, since there will be random port conflicts that make the tests flaky.
In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files?
Unless I'm missing something, that's not quite enough. That will still require us to completely serialize all the tests that deploy ClickHouse. The problem is the port numbers. I don't think we've found a way to start a cluster where the OS assigns the ports, specifically to the Keepers, right? Without that, there is no way to reliably start two clusters at once, since there will be random port conflicts that make the tests flaky.
Unfortunately this is true. I had high hopes that we would be able to get rid of single node clickhouse deployments in tests for the same reason as @karencfv. Configuring clickhouse clusters requires us to write out the config files, which requires us to choose the ports. There's no good way to do this without a TOCTTOU. In a4x2
, Ry and co tried to reserve ports by squatting on them via OS allocation and then releasing them back to the pool so they could be explicitly used. But any process that gets in during that release can break this scheme, and the result is very intermittent flakes.
I'd probably recommend either changing this issue to mention it being for production only or closing it altogether as well. I'm not sure it buys us anything, since we know we are following the plan in RFD468 for production purposes. And if we figure out a way to only deploy clusters and run tests in parallel, we'll almost certainly do it regardless of if there's an issue. Otherwise, this issue will stay open and inactionable forever.
Of course! The ports, gargh. Yeah, ok, that makes sense. I think I'll close this then. When we need to keep track of the work for removing single node production, the item list will be different to what this issue has anyway. I guess it makes sense to close this one.
Once replicated mode has been enabled, there will be no further use for single node ClickHouse. The init SQL is wildly different between replicated and single node mode, this means that any testing and/or development done on a single node will be unreliable on a replicated node setup.
Several parts of the codebase will need to be modified: