scylladb / scylla-ccm

Cassandra Cluster Manager, modified for Scylla
Apache License 2.0
20 stars 62 forks source link

scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners #523

Closed bhalevy closed 6 months ago

bhalevy commented 8 months ago

It is not enough for node to know about other nodes' tokens, as they might not be reflected in the token_metadata map.

Instead, check the /storage_service/host_id api that provides a list of nodes that are normal token owners and ready to be used by queries.

Refs https://github.com/scylladb/scylladb/issues/15146

bhalevy commented 8 months ago

@nyh, @fruch please review

fruch commented 8 months ago

Seems o.k.

But I'll want to take a ride in at least in gating

fruch commented 8 months ago

@bhalevy

giving it a go:

fruch commented 8 months ago

@bhalevy

giving it a go:

@bhalevy seems like we are getting None as host_id all over the place

bhalevy commented 8 months ago

@bhalevy seems like we are getting None as host_id all over the place

should be fixed now (at least it passes for me locally, for example update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_removenode_3[api-True])

fruch commented 8 months ago

@bhalevy seems like we are getting None as host_id all over the place

should be fixed now (at least it passes for me locally, for example update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_removenode_3[api-True])

running gating again:

bhalevy commented 8 months ago

so to force reload of the host_id after restart as the node may wake up with a different host_id (after being wiped)

bhalevy commented 8 months ago

https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/142/

bhalevy commented 8 months ago

Depends on https://github.com/scylladb/scylla-dtest/pull/3716

bhalevy commented 8 months ago

In 54332a9:

fruch commented 8 months ago

giving one last run, this is some that gonna happen on every single test:

@bhalevy also keep in mind it might break non gating tests... we'll need to keep that in mind following this merge

bhalevy commented 8 months ago

@bhalevy also keep in mind it might break non gating tests... we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

fruch commented 8 months ago

@bhalevy also keep in mind it might break non gating tests... we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

And which tests are gonna use that ? Only one we will be unstable?

bhalevy commented 8 months ago

@bhalevy also keep in mind it might break non gating tests... we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

And which tests are gonna use that ? Only one we will be unstable?

Yes. This category of tests monitor starting nodes' log looking for particular events. In most cases we already call start with wait_other_notice=False, wait_for_binary_protocol=False, but in cases we didn't do that we'll have to figure out what's the right thing to do.

bhalevy commented 8 months ago

giving one last run, this is some that gonna happen on every single test:

There's only https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/146/testReport/junit/cql_tests/TestTruncate/FullDtest___full_split000___test_cql_query_filtering_without_indexes/ which seems unrelated. Need to analyse it further

bhalevy commented 8 months ago

This is a test issue: https://github.com/scylladb/scylla-dtest/issues/3717

bhalevy commented 7 months ago

@nyh since you're the last to change watch_rest_for_alive, any comments before we merge this change?

bhalevy commented 7 months ago

@nyh please review (since you were the last one to change this area)

bhalevy commented 7 months ago

In 5fb352a:

bhalevy commented 7 months ago

@nyh please re-review. I addressed your review comments.

bhalevy commented 7 months ago

@nyh please re-review. I addressed your review comments.

@nyh ping, when do you think you can re-review?

bhalevy commented 6 months ago
bhalevy commented 6 months ago

All checks have passed, please merge

bhalevy commented 6 months ago

@fruch can you please backport to 5.4 and 2024.1?

mykaul commented 6 months ago

This kinda reminds a (very) long discussion with the operator team, where they did not know when can they safely do a rolling restart of Scylla pods - and the request was a simple API to let them know 'yes, it's real, this node is really up and serving traffic and is part of the cluster and you can safely move to the next node and restart it'. See https://github.com/scylladb/scylladb/issues/8275