scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 94 forks source link

Hydra fails to connect with SSH - no attribute 'update_default_time_range' #9023

Closed CodeLieutenant closed 1 week ago

CodeLieutenant commented 1 week ago

Explanation:

When running SCT on a bit slower connection, test cannot establish SSH connection, and then fails on AttributeError: 'NoneType' object has no attribute 'update_default_time_range'. This issue can only be reproduced on networks with a latency 200ms+

Running command

hydra run-test longevity_test.LongevityTest.test_custom_time --backend aws --config test-cases/gemini/gemini-basic-3h.yaml
< t:2024-10-21 13:24:11,580 f:ec2_client.py   l:176  c:sdcm.ec2_client      p:INFO  > Waiting for spot instances...
< t:2024-10-21 13:24:16,826 f:ec2_client.py   l:295  c:sdcm.ec2_client      p:INFO  > Spot instances: ['i-0676ed1ff48135e0c', 'i-0574312c7455940a3', 'i-0f0691afaeaaecd01']
< t:2024-10-21 13:33:05,796 f:wait.py         l:79   c:sdcm.wait            p:ERROR > Wait for: gemini-basic-3h-dmalusev-db-node-788ffc47-1: Waiting for SSH to be up: timeout - 500 seconds - expired
< t:2024-10-21 13:33:05,797 f:wait.py         l:83   c:sdcm.wait            p:ERROR > last error: RetryError(<Future at 0x703ccbb6edd0 state=finished returned bool>)
< t:2024-10-21 13:33:05,802 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2024-10-21 13:33:05.797: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=87689964-f79f-4ea3-bc82-9d08437b8183, source=LongevityTest.SetUp()
< t:2024-10-21 13:33:05,802 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > exception=Wait for: gemini-basic-3h-dmalusev-db-node-788ffc47-1: Waiting for SSH to be up: timeout - 500 seconds - expired
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR > Exception in setUp. Will call tearDown < t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR > Exception in setUp. Will call tearDown
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR > Traceback (most recent call last):
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/wait.py", line 70, in wait_for
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     res = retry(func, **kwargs)
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     do = self.iter(retry_state=retry_state)
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     raise retry_exc.reraise()
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     raise self
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR > tenacity.RetryError: RetryError[<Future at 0x703ccbb6edd0 state=finished returned bool>]
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR > The above exception was the direct cause of the following exception:
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR > Traceback (most recent call last):
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 187, in wrapper
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     return method(*args, **kwargs)
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/utils/decorators.py", line 121, in inner
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     res = func(*args, **kwargs)
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 966, in setUp
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     self.init_resources()
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 1937, in init_resources
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     self.get_cluster_aws(loader_info=loader_info, db_info=db_info,
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 1521, in get_cluster_aws
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     self.db_cluster = create_cluster('scylla')
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 1479, in create_cluster
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     return ScyllaAWSCluster(
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 899, in __init__
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     super().__init__(
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster.py", line 4008, in __init__
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     super().__init__(*args, **kwargs)
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 111, in __init__
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     super().__init__(cluster_uuid=cluster_uuid,
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster.py", line 3258, in __init__
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     self.add_nodes(nodes_per_az[az_index], dc_idx=dc_idx, rack=rack,
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 937, in add_nodes
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     added_nodes = super().add_nodes(
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 410, in add_nodes
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     node = self._create_node(instance, self._ec2_ami_username, self.node_prefix,
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 429, in _create_node
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     node.init()
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 539, in init
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     super().init()
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster.py", line 373, in init
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     self.wait_ssh_up(verbose=True)
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster.py", line 1230, in wait_ssh_up
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     wait.wait_for(func=self.remoter.is_up, step=10, text=text, timeout=timeout, throw_exc=True)
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >   File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/wait.py", line 86, in wait_for
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR >     raise raising_exc from ex
< t:2024-10-21 13:33:05,798 f:tester.py       l:194  c:sdcm.tester          p:ERROR > sdcm.exceptions.WaitForTimeoutError: Wait for: gemini-basic-3h-dmalusev-db-node-788ffc47-1: Waiting for SSH to be up: timeout - 500 seconds - expired
< t:2024-10-21 13:33:05,806 f:tester.py       l:2959 c:LongevityTest        p:INFO  > Post test validators are starting...
< t:2024-10-21 13:33:05,806 f:base.py         l:22   c:sdcm.teardown_validators.base p:INFO  > scrub validator is disabled
< t:2024-10-21 13:33:05,806 f:base.py         l:22   c:sdcm.teardown_validators.base p:INFO  > test_error_events validator is disabled
< t:2024-10-21 13:33:05,806 f:tester.py       l:2962 c:LongevityTest        p:INFO  > TearDown is starting...
< t:2024-10-21 13:33:05,807 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-10-21 13:33:05.805: (InfoEvent Severity.NORMAL) period_type=not-set event_id=485e907e-1a20-4d09-a499-72e26dd11752: message=TEST_END
< t:2024-10-21 13:33:06,810 f:capacity_reservation.py l:227  c:sdcm.provision.aws.capacity_reservation p:INFO  > No capacity reservations to cancel.
E
======================================================================
ERROR: test_custom_time (longevity_test.LongevityTest)
Run cassandra-stress with params defined in data_dir/scylla.yaml
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/wait.py", line 70, in wait_for
    res = retry(func, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
    raise retry_exc.reraise()
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
    raise self
tenacity.RetryError: RetryError[<Future at 0x703ccbb6edd0 state=finished returned bool>]

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 187, in wrapper
    return method(*args, **kwargs)
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/utils/decorators.py", line 121, in inner
    res = func(*args, **kwargs)
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 966, in setUp
    self.init_resources()
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 1937, in init_resources
    self.get_cluster_aws(loader_info=loader_info, db_info=db_info,
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 1521, in get_cluster_aws
    self.db_cluster = create_cluster('scylla')
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 1479, in create_cluster
    return ScyllaAWSCluster(
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 899, in __init__
    super().__init__(
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster.py", line 4008, in __init__
    super().__init__(*args, **kwargs)
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 111, in __init__
    super().__init__(cluster_uuid=cluster_uuid,
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster.py", line 3258, in __init__
    self.add_nodes(nodes_per_az[az_index], dc_idx=dc_idx, rack=rack,
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 937, in add_nodes
    added_nodes = super().add_nodes(
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 410, in add_nodes
    node = self._create_node(instance, self._ec2_ami_username, self.node_prefix,
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 429, in _create_node
    node.init()
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster_aws.py", line 539, in init
    super().init()
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster.py", line 373, in init
    self.wait_ssh_up(verbose=True)
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/cluster.py", line 1230, in wait_ssh_up
    wait.wait_for(func=self.remoter.is_up, step=10, text=text, timeout=timeout, throw_exc=True)
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/wait.py", line 86, in wait_for
    raise raising_exc from ex
sdcm.exceptions.WaitForTimeoutError: Wait for: gemini-basic-3h-dmalusev-db-node-788ffc47-1: Waiting for SSH to be up: timeout - 500 seconds - expired

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 195, in wrapper
    args[0].tearDown()
  File "/home/dmalusev/Code/ScyllaDB/scylla-cluster-tests/sdcm/tester.py", line 2978, in tearDown
    self.monitors.update_default_time_range(self.start_time, time.time())
AttributeError: 'NoneType' object has no attribute 'update_default_time_range'

----------------------------------------------------------------------
soyacz commented 1 week ago

@CodeLieutenant I see you ran it locally.=, are you sure you placed correct public ip configuration when running the test? By default public ip's are blocked and this shows up usually like this.

CodeLieutenant commented 1 week ago

@CodeLieutenant I see you ran it locally.=, are you sure you placed correct public ip configuration when running the test? By default public ip's are blocked and this shows up usually like this.

I have not set public ip. Can you point me where to set it?

soyacz commented 1 week ago

@CodeLieutenant I see you ran it locally.=, are you sure you placed correct public ip configuration when running the test? By default public ip's are blocked and this shows up usually like this.

I have not set public ip. Can you point me where to set it?

Should be enough to add configurations/network_config/test_communication_public.yaml to list of config files