Closed mykaul closed 1 month ago
@kbr-scylla @fruch
Are you aware of any problems that GrowShrinkClusterNemesis
, TerminateAndRemoveNodeMonkey
, NodeTerminateAndReplace
, DecommissionMonkey
might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?
@kbr-scylla @fruch Are you aware of any problems that
GrowShrinkClusterNemesis
,TerminateAndRemoveNodeMonkey
,NodeTerminateAndReplace
,DecommissionMonkey
might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?
The only question is if we wish to clone more jobs and change the method from RBNO to streaming. I prefer converting SOME of the existing jobs to use streaming instead of RBNO. I understand the concern with randomizing - it is less predicatable - but it also has its value as well, as it'll ensure we continue to test more features, they will be tested with streaming. I do suggest we just change some existing longevities to use streaming for the time being.
@kbr-scylla @fruch Are you aware of any problems that
GrowShrinkClusterNemesis
,TerminateAndRemoveNodeMonkey
,NodeTerminateAndReplace
,DecommissionMonkey
might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?The only question is if we wish to clone more jobs and change the method from RBNO to streaming. I prefer converting SOME of the existing jobs to use streaming instead of RBNO. I understand the concern with randomizing - it is less predicatable - but it also has its value as well, as it'll ensure we continue to test more features, they will be tested with streaming. I do suggest we just change some existing longevities to use streaming for the time being.
Adding new, or converting some, both are o.k.
Someone that owns the feature and its testing can take those calls.
Since for quite some time it wasn't tested with streaming, we don't have any information regarding any part that is ready for that or not, at some point they did work with streaming.
Picking the cases, should be random, but it might.
We are not gonna randomize it at test run time, every time we did such a thing, it waste x10 of people time, chasing the wind with this no on remembered is randomized
We are not gonna randomize it at test run time, every time we did such a thing, it waste x10 of people time, chasing the wind with this no on remembered is randomized
What exactly was the problem? Yes it can waste people's time, but I think it won't if we do it in a controlled manner: it must be clear which parameters were randomized in this test run, and how to rerun the test with exactly the same param values (most conveniently by passing the seed that was used)
Note that there's a lot of randomness in longevity already. cassandra-stress loads are generated by random distributions. And this is the less convenient randomization case: you cannot really repeat what cassandra-stress did in a given run, it all depends on timing and the environment etc.
The kind of randomization I propose is much more manageable
Note that there's a lot of randomness in longevity already. cassandra-stress loads are generated by random distributions. And this is the less convenient randomization case: you cannot really repeat what cassandra-stress did in a given run, it all depends on timing and the environment etc.
The kind of randomization I propose is much more manageable
I.e. someone would need to manage it, are you volunteering ? :)
one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.
one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.
it might be relevant, if it's a nemesis level call, but we are talking about scylla configuration, that control which operation is RBNO and which is not, so it's on a test case level.
one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.
I think that is good idea regardless, I think SCT needs more transparency in general, but I do not think it solves the randomization problem completely. By adding another layer of randomization we are lowering chances that the configuration we want happens, add additional level of review requires (did we run it with streaming or rbno, do we need to run again to test the other one as well?) and make in even less transparent that it is today.
I also think randomization is not scalable, truth be told, current proposed solution (i.e. switching some tests to streaming) is also not scalable, and we need to discuss after this "hotfix" how to deal with these issues in the future, but at least it does not make SCT even less transparent
We apparently do not test node operations with streaming - only with RBNO, which is the default. That's fine for the majority of the tests, but we need some sanity around streaming. Please make sure we have simple add/decommission/replace nemsis tests with streaming (RBNO disabled).
CC @kbr-scylla