scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners

bhalevy commented 8 months ago

It is not enough for node to know about other nodes' tokens, as they might not be reflected in the token_metadata map.

Instead, check the /storage_service/host_id api that provides a list of nodes that are normal token owners and ready to be used by queries.

Refs https://github.com/scylladb/scylladb/issues/15146

bhalevy commented 8 months ago

@nyh, @fruch please review

fruch commented 8 months ago

Seems o.k.

But I'll want to take a ride in at least in gating

fruch commented 8 months ago

@bhalevy

giving it a go:

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/134/ (OSS)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/135/ (OSS + raft)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/136/ (Enterprise)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/137/ (Enterprise + raft)

fruch commented 8 months ago

@bhalevy

giving it a go:

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/134/ (OSS)

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/135/ (OSS + raft)

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/136/ (Enterprise)

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/137/ (Enterprise + raft)

@bhalevy seems like we are getting None as host_id all over the place

bhalevy commented 8 months ago

@bhalevy seems like we are getting None as host_id all over the place

should be fixed now (at least it passes for me locally, for example update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_removenode_3[api-True])

fruch commented 8 months ago

@bhalevy seems like we are getting None as host_id all over the place

should be fixed now (at least it passes for me locally, for example update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_removenode_3[api-True])

running gating again:

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/138/ (OSS)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/139/ (OSS + raft)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/140/ (Enterprise)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/141/ (Enterprise + raft)

bhalevy commented 8 months ago

scylla_node: start_scylla: reset node_hostid

so to force reload of the host_id after restart as the node may wake up with a different host_id (after being wiped)

bhalevy commented 8 months ago

https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/142/

bhalevy commented 8 months ago

Depends on https://github.com/scylladb/scylla-dtest/pull/3716

bhalevy commented 8 months ago

In 54332a9:

scylla_node: hostid: handle errors
scylla_node: watch_rest_for_alive: check response.status_code
scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners:
- extends current function by waiting on tokens first, then on nrmal token ownership
scylla_node: watch_rest_for_alive: add wait_normal_token_owner param
- for backward compatibility
- applied also from ScyllaCluster::start_nodes

fruch commented 8 months ago

giving one last run, this is some that gonna happen on every single test:

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/143/ (OSS)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/144/ (OSS + raft)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/145/ (Enterprise)
[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/146/ (Enterprise + raft)

@bhalevy also keep in mind it might break non gating tests... we'll need to keep that in mind following this merge

bhalevy commented 8 months ago

@bhalevy also keep in mind it might break non gating tests... we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

fruch commented 8 months ago

@bhalevy also keep in mind it might break non gating tests... we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

And which tests are gonna use that ? Only one we will be unstable?

bhalevy commented 8 months ago

@bhalevy also keep in mind it might break non gating tests... we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

And which tests are gonna use that ? Only one we will be unstable?

Yes. This category of tests monitor starting nodes' log looking for particular events. In most cases we already call start with wait_other_notice=False, wait_for_binary_protocol=False, but in cases we didn't do that we'll have to figure out what's the right thing to do.

bhalevy commented 8 months ago

giving one last run, this is some that gonna happen on every single test:

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/143/ (OSS)

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/144/ (OSS + raft)

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/145/ (Enterprise)

[ ] https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/146/ (Enterprise + raft)

There's only https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/146/testReport/junit/cql_tests/TestTruncate/FullDtest___full_split000___test_cql_query_filtering_without_indexes/ which seems unrelated. Need to analyse it further

bhalevy commented 8 months ago

This is a test issue: https://github.com/scylladb/scylla-dtest/issues/3717

bhalevy commented 7 months ago

@nyh since you're the last to change watch_rest_for_alive, any comments before we merge this change?

bhalevy commented 7 months ago

@nyh please review (since you were the last one to change this area)

bhalevy commented 7 months ago

In 5fb352a:

scylla_node: override hostid() using the REST api
- used timeout arg for requests.get
scylla_node: start_scylla: reset node_hostid
- added code comment explaining why we do that
scylla_node: hostid: handle errors
- dropped pass statement
scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners
- trim tofind as nodes are found
- but maintain cummulative found set and found_host_id_map dict for debugging purposes
scylla_node: watch_rest_for_alive: add wait_normal_token_owner param
- Document wait_normal_token_owner param also in watch_rest_for_alive

bhalevy commented 7 months ago

@nyh please re-review. I addressed your review comments.

bhalevy commented 7 months ago

@nyh please re-review. I addressed your review comments.

@nyh ping, when do you think you can re-review?

bhalevy commented 6 months ago

fixed spelling mistake and rebased @fruch please merge when CI passed

bhalevy commented 6 months ago

All checks have passed, please merge

bhalevy commented 6 months ago

@fruch can you please backport to 5.4 and 2024.1?

mykaul commented 6 months ago

This kinda reminds a (very) long discussion with the operator team, where they did not know when can they safely do a rolling restart of Scylla pods - and the request was a simple API to let them know 'yes, it's real, this node is really up and serving traffic and is part of the cluster and you can safely move to the next node and restart it'. See https://github.com/scylladb/scylladb/issues/8275

scylladb / scylla-ccm

scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners #523