Closed nyh closed 1 year ago
@nyh seems like there are some places in dtest using watch_log_for_alive
, we probably should replace them after this one gets merged
@nyh seems like there are some places in dtest using
watch_log_for_alive
, we probably should replace them after this one gets merged
Do we run dtests also on Cassandra? If we do, my new function won't work there :-(
@nyh seems like there are some places in dtest using
watch_log_for_alive
, we probably should replace them after this one gets mergedDo we run dtests also on Cassandra? If we do, my new function won't work there :-(
we could run with cassnadra, but we don't do that regularly
we could overload watch_log_for_alive
to always go to watch_rest_for_alive
if it's scylla.
and then test would still work with no change needed
Pushed new version with that "_" thing. Didn't change the sleep from 0.1 seconds - I can change it to anything, but I don't know what to pick...
Pushed a new version, the REST API now also waits for nodes to know other nodes' tokens. With this patch, when test is modified to disabled "wait for gossip to settle", we're down from 170 failures when I started to "just" 38.
Ping.
@bhalevy, as whom refactored it a few time, care to also give it a look ?
@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.
And then we could merge it
@eliransin - do you need to review this to get this moving forward?
@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.
And then we could merge it
running this with a full suite of dtest, to verify it's not causing regressions (or not too many of them) https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/
@eliransin - do you need to review this to get this moving forward?
it's a change that affects all of the tests, I would want it tested first (@nyh has tested it so far only with the 'skip_wait_for_gossip_to_settle': 0
)
pushed new version - rebased, and changed {} to set() as @bhalevy noticed.
@eliransin - do you need to review this to get this moving forward?
it's a change that affects all of the tests, I would want it tested first (@nyh has tested it so far only with the
'skip_wait_for_gossip_to_settle': 0
)
I think you did this above, but I don't know how to read the results (if I understood correctly, we can't expect zero failures?).
@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it. And then we could merge it
running this with a full suite of dtest, to verify it's not causing regressions (or not too many of them) https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/
@nyh only a handful of failures
some are timing out the 2min timeout of currently introduced wait function. I'm running now only the gating with it: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/315/
this test expect boot to failed, test should enable wait notice others
fails a bit like failures in debug runs...
missing data in node3, doesn't fail like that on master runs
@nyh gating passed o.k.
so we are left with 3 tests that seems to regress (and we didn't tested enterprise features, even that most of them are single node)
@bhalevy @eliransin, it's your call if we are o.k. with fixing those tests later, this should be block until someone would be able to attend to those ?
I don't see how it would be ok to introduce regressions
I don't see how it would be ok to introduce regressions
It's not ok, but to be honest, I don't have the peace of mind to fix random buggy dtests while I keep getting requests to do other things at top priority, so I'm not going to fix this any time soon.
I started a new run with this: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/
The 3 test failures @fruch mentioned are directly related but looks quite easy to fix, so we can assign someone to fix.
However, this PR by itself won't improve runtime and possibly even degrade runtime.
I started a new run with this: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/
The 3 test failures @fruch mentioned are directly related but looks quite easy to fix, so we can assign someone to fix.
Thanks.
However, this PR by itself won't improve runtime and possibly even degrade runtime.
The idea is to have a dtest PR which runs Scylla without huge sleeps. The branch is https://github.com/nyh/scylla-dtest/commits/quick-boot-2. There is a one-line patch https://github.com/nyh/scylla-dtest/commit/0b23742b4622c9b6e18f4613722ce433640cf41c plus one fix for one test's bug - https://github.com/nyh/scylla-dtest/commit/0b23742b4622c9b6e18f4613722ce433640cf41c. There will need to be more patches because a few dozen tests still fail (not as many as before the ccm patches, and but still, a non-zero list).
The next step can be to either make the dtest patch an optional parameter (so developers can use it), but @fruch and @eliransin suggested that it can be the default, and just add a tag for tests that can't work in this "sleepless" mode and only they will be run in the slow mode.
I'm handling the tests that need to be fixed,
looks like it's identical in all 3 tests, we need to not wait_other_notice
, and it solves the issue.
so this last run had 8 failure, 6 of them would be solved by https://github.com/scylladb/scylla-dtest/pull/3313. and the rest 2 are identical to master failures (which mean we have only 2 failures on master ????)
I'm handling the tests that need to be fixed, looks like it's identical in all 3 tests, we need to not
wait_other_notice
, and it solves the issue.so this last run had 8 failure, 6 of them would be solved by scylladb/scylla-dtest#3313. and the rest 2 are identical to master failures (which mean we have only 2 failures on master ????)
I was skeptical, so went to check: https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release/306/testReport/
for the last 3 runs, we have only 2 failing tests
one more test that need to be addressed: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/317/testReport/throughput_limits_test/TestStreamingLimitThroughput/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split005___test_limit_rbno_bootstrap_throughput/
@nyh This is incorrect. It may fix some problems but introduces others
Consider the following:
wait_other_notice=True
The start
call will often return quickly, but not because node B noticed that A restarted
but because node B didn't even notice that A was killed yet!
That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice
. That is, other nodes really have to notice this new node restart!
We're actually now observing this scenario trying to unflake one dtest with @temichus (https://github.com/scylladb/scylla-dtest/pull/3991)...
@nyh This is incorrect. It may fix some problems but introduces others
Consider the following:
- 2 nodes A, B
- kill node A
- start node A with
wait_other_notice=True
The
start
call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for
wait_other_notice
. That is, other nodes really have to notice this new node restart!We're actually now observing this scenario trying to unflake one dtest with @temichus (scylladb/scylla-dtest#3991)...
@kbr-scylla
I've created PR where the issue can be reproduced https://github.com/scylladb/scylla-dtest/pull/4042/files approx with 500 runs, we should get at least 1 error.
@nyh This is incorrect. It may fix some problems but introduces others
Consider the following:
- 2 nodes A, B
- kill node A
- start node A with
wait_other_notice=True
The
start
call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!
You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.
That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for
wait_other_notice
. That is, other nodes really have to notice this new node restart!We're actually now observing this scenario trying to unflake one dtest with @temichus (scylladb/scylla-dtest#3991)...
You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.
Which is only a workaround for the problem, the implementation of wait_other_notice
is still incorrect
You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.
Which is only a workaround for the problem, the implementation of
wait_other_notice
is still incorrect
Then it needs to be fixed. Issue nr?
After starting a multi-node cluster, it's important to wait until all nodes are aware that all other nodes are available, otherwise if the user sends a CL=ALL request to one node, it might not be aware that the other replicas are usable, and fail the request.
For this reason a cluster's start() has a wait_other_notice=True option, and dtests correctly use it. However, the implementation doesn't wait for the right thing... To check if node A thinks that B is available, it checks that A's log contains the message "InetAddress B is now UP". But this message is unreliable - when it is printed, A still doesn't think that B is fully available - it can still think that B is in a "joining" state for a while longer. If wait_other_notice returns at this point, and the user sends a CL=ALL request to node A, it will fail.
The solution I propose in this patch uses the REST API, instead of the log, to wait until node A thinks node B is both live and finished joining.
This patch is needed if Scylla is modified to boot up faster. We start seeing dtests which use RF=ALL in the beginning of a test failing, because the node we contact doesn't know that the other nodes are usable.
Fixes #461