scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 94 forks source link

Test node operations with streaming (NOT with RBNO) #8289

Closed mykaul closed 1 month ago

mykaul commented 2 months ago

We apparently do not test node operations with streaming - only with RBNO, which is the default. That's fine for the majority of the tests, but we need some sanity around streaming. Please make sure we have simple add/decommission/replace nemsis tests with streaming (RBNO disabled).

CC @kbr-scylla

pehala commented 2 months ago

@kbr-scylla @fruch Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

mykaul commented 2 months ago

@kbr-scylla @fruch Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

The only question is if we wish to clone more jobs and change the method from RBNO to streaming. I prefer converting SOME of the existing jobs to use streaming instead of RBNO. I understand the concern with randomizing - it is less predicatable - but it also has its value as well, as it'll ensure we continue to test more features, they will be tested with streaming. I do suggest we just change some existing longevities to use streaming for the time being.

fruch commented 2 months ago

@kbr-scylla @fruch Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

The only question is if we wish to clone more jobs and change the method from RBNO to streaming. I prefer converting SOME of the existing jobs to use streaming instead of RBNO. I understand the concern with randomizing - it is less predicatable - but it also has its value as well, as it'll ensure we continue to test more features, they will be tested with streaming. I do suggest we just change some existing longevities to use streaming for the time being.

Adding new, or converting some, both are o.k.

Someone that owns the feature and its testing can take those calls.

Since for quite some time it wasn't tested with streaming, we don't have any information regarding any part that is ready for that or not, at some point they did work with streaming.

Picking the cases, should be random, but it might.

We are not gonna randomize it at test run time, every time we did such a thing, it waste x10 of people time, chasing the wind with this no on remembered is randomized

kbr-scylla commented 2 months ago

We are not gonna randomize it at test run time, every time we did such a thing, it waste x10 of people time, chasing the wind with this no on remembered is randomized

What exactly was the problem? Yes it can waste people's time, but I think it won't if we do it in a controlled manner: it must be clear which parameters were randomized in this test run, and how to rerun the test with exactly the same param values (most conveniently by passing the seed that was used)

kbr-scylla commented 2 months ago

Note that there's a lot of randomness in longevity already. cassandra-stress loads are generated by random distributions. And this is the less convenient randomization case: you cannot really repeat what cassandra-stress did in a given run, it all depends on timing and the environment etc.

The kind of randomization I propose is much more manageable

fruch commented 2 months ago

Note that there's a lot of randomness in longevity already. cassandra-stress loads are generated by random distributions. And this is the less convenient randomization case: you cannot really repeat what cassandra-stress did in a given run, it all depends on timing and the environment etc.

The kind of randomization I propose is much more manageable

I.e. someone would need to manage it, are you volunteering ? :)

soyacz commented 2 months ago

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

fruch commented 2 months ago

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

it might be relevant, if it's a nemesis level call, but we are talking about scylla configuration, that control which operation is RBNO and which is not, so it's on a test case level.

pehala commented 2 months ago

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

I think that is good idea regardless, I think SCT needs more transparency in general, but I do not think it solves the randomization problem completely. By adding another layer of randomization we are lowering chances that the configuration we want happens, add additional level of review requires (did we run it with streaming or rbno, do we need to run again to test the other one as well?) and make in even less transparent that it is today.

I also think randomization is not scalable, truth be told, current proposed solution (i.e. switching some tests to streaming) is also not scalable, and we need to discuss after this "hotfix" how to deal with these issues in the future, but at least it does not make SCT even less transparent