scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.32k stars 1.26k forks source link

Add to test/alternator/run an option to run multiple Scylla nodes #7236

Open nyh opened 4 years ago

nyh commented 4 years ago

The Alternator test suite is usually run via the script, test/alternator/run. This script runs a single-node (with two shards) Scylla and runs the tests against it. However, we might have bugs which are specific to multi-node clusters, so once in a while we should run the same tests against a multi-node cluster. This issue requests adding an option to the test/alternator/run script to run multiple Scylla nodes.

As in the existing script, we need to pay attention to making the tests reasonably fast. The Scylla boot speed is most important, because it is important when running a single test (something often done during development). For the --skip-wait-for-gossip-to-settle option that run uses was very important to reduce boot delay, and it would be good to keep it - but this may perhaps require starting the Scylla processes in order instead of in parallel. Another example is --alternator-streams-time-window-s which can slow down all the Streams test, and even if we can no longer use 0 (?), we shouldn't use 10 seconds either.

nyh commented 4 years ago

When we have a test/alternator/run option to run a cluster of multiple Scylla nodes, we can also have special tests which need such a cluster - and skipped when the cluster is smaller. For example, the test I wrote for issue #7236 (https://github.com/scylladb/scylla-dtest/pull/1698) requires a cluster of (at least) two nodes, and can reproduce the bug in a few seconds. Right now I'm planning to move this test to dtest, but theoretically it could also live in test/alternator. dtest shines when the test needs to change the cluster on the fly - but this test does not, so it could have been a test/alternator test.

We could put the multi-node tests in a separate source file, and it can have the fixtures and utilities like the following ideas:

  1. Just like we have the "dynamodb" fixture today containing the boto3 resource to connect to the server, we will have a fixture which will open N separate resources to N separate nodes. On Alternator, this will use the /localnodes request on the single known IP address to find all Alternator nodes, and skip the test if we don't have N of them.
  2. Just like we have the "test_table" fixture today, we will have a fixture creating test_table and returning N "table" objects to access the same table through N different nodes.
  3. However, fixtures may not actually be the best approach here. The above fixtures are enough for tests like https://github.com/scylladb/scylla-dtest/pull/1698 where each client thread accesses a different node. But in more general tests, we may want one thread to reach all nodes, so we need each client thread to open N different resources or tables, separately, because boto3 isn't thread-safe. So maybe these shouldn't be a "fixture" at all, but rather a function that will be called explicitly in each thread, and will return new resources to connect to N different nodes, and N "Table" objects using those N resources (the table creation itself will only happen once, of course, through the existing test_table fixture).
nyh commented 1 year ago

In #11645, we noticed that several tests of test/cql-pytest (we didn't check test/alternator) currently fail against a 3-node cluster and nobody noticed. Most of these are caused by problems in the test - not in Scylla. This demonstrates why having the ability to run 3-node tests once in a while would be useful.

nyh commented 1 year ago

Note that by now, we have much more elaborate support in test.py and test/pylib/ for tests written in Python that can use clusters of arbitrary and even changing sizes, similar to dtest's capabilities. So we don't want to replicate these features in test/alternator/run.

It may still be useful, however, to have the ability to start a test cluster with N nodes instead of 1, and run all the usual single-node tests on it. Just like a user can start a N-node cluster manually, and run "pytest" on it.