Closed fruch closed 2 years ago
@asias I guess that if we wait like this:
node3.watch_log_for("FatClient .* has been silent for .*ms, removing from gossip")
problem is with scylladb-operator, that that user that wait for those, and it can react to different actions, but it need checks the gossip before doing any commands
The 5cd97c964c3b2b4bb11cc2252ef576358447068c fails as below. It is weird nodetool gossip fails. I think it is not related to the bug though.
update_cluster_layout_tests.py:2155:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../scylla-ccm/ccmlib/scylla_node.py:684: in nodetool
return super().nodetool(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <ccmlib.scylla_node.ScyllaNode object at 0x7f0f67675f30>, cmd = 'gossipinfo', capture_output = False, wait = True, timeout = None
def nodetool(self, cmd, capture_output=True, wait=True, timeout=None):
"""
Setting wait=False makes it impossible to detect errors,
if capture_output is also False. wait=False allows us to return
while nodetool is still running.
When wait=True, timeout may be set to a number, in seconds,
to limit how long the function will wait for nodetool to complete.
"""
if capture_output and not wait:
raise common.ArgumentError("Cannot set capture_output while wait is False.")
env = self.get_env()
if self.is_scylla() and not self.is_docker():
host = self.address()
else:
host = 'localhost'
nodetool = self.get_tool('nodetool')
if not isinstance(nodetool, list):
nodetool = [nodetool]
# see https://www.oracle.com/java/technologies/javase/8u331-relnotes.html#JDK-8278972
nodetool.extend(['-h', host, '-p', str(self.jmx_port), '-Dcom.sun.jndi.rmiURLParsing=legacy'])
nodetool.extend(cmd.split())
if capture_output:
p = subprocess.Popen(nodetool, universal_newlines=True, env=env, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = p.communicate(timeout=timeout)
else:
p = subprocess.Popen(nodetool, env=env, universal_newlines=True)
stdout, stderr = None, None
if wait:
exit_status = p.wait(timeout=timeout)
if exit_status != 0:
> raise NodetoolError(" ".join(nodetool), exit_status, stdout, stderr)
E ccmlib.node.NodetoolError: Nodetool command '/home/asias/src/cloudius-systems/scylla/resources/cassandra/bin/nodetool -h 127.0.60.33 -p 7199 -Dcom.sun.jndi.rmiURLParsing=legacy gossipinfo' failed; exit status: 1
../scylla-ccm/ccmlib/node.py:795: NodetoolError
@asias I guess that if we wait like this:
node3.watch_log_for("FatClient .* has been silent for .*ms, removing from gossip")
problem is with scylladb-operator, that that user that wait for those, and it can react to different actions, but it need checks the gossip before doing any commands
With the PR https://github.com/scylladb/scylladb/pull/11361. No need to wait for the removal of the old node from gossip.
@slivne FYI, an issue related to ip changes, and topology changes
@asias what do you think about backporting this? is it safe, or should we wait for 5.2 soak time?
I guess it's safe since master and 5.2 have this for a long time.
All live versions have this, removing label.
While building a reproducer for scylladb/scylla-operator#982 and #11302 we run into this case:
adding node4, fails with the following:
all 3 nodes show 4 nodes in the gossip, while 2 are having the exact same host_id, one in shutdown state, one in LEFT state: