scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
52 stars 83 forks source link

Azure timeout caused nemesis, logs collection and resources cleanup failure #7130

Open soyacz opened 5 months ago

soyacz commented 5 months ago

Issue description

It started with failure of restarting VM, where we run reboot command on Azure:

2024-01-21 03:24:25.009: (DisruptionEvent Severity.ERROR) period_type=end event_id=c90608a7-d509-4426-9aae-b1ef967f205b duration=3m2s: nemesis_name=DecommissionStreamingErr target_node=Node longevity-tls-1tb-7d-master-db-node-bd04be7c-eastus-3 [172.174.255.172 | 10.0.0.7] (seed: True) errors=functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7f4117bb44c0>>, log_follower=<sdcm.utils.file.ReiterableGenerator object at 0x7f40bab4d630>, disrupt_func=<bound method Nemesis.reboot_node of <sdcm.nemesis.SisyphusMonkey object at 0x7f4117bb44c0>>, disrupt_func_kwargs={'target_node': <sdcm.cluster_azure.AzureNode object at 0x7f411e5c1a80>, 'hard': True, 'verify_ssh': True}, delay=0):
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 472, in run
result = future.result(time_out)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 448, in inner
return_val = fun(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 509, in <lambda>
return self.run(lambda x: x(), ignore_exceptions=ignore_exceptions)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3737, in _call_disrupt_func_after_expression_logged
disrupt_func(**disrupt_func_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 318, in inner_func
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3688, in reboot_node
target_node.reboot(hard=hard, verify_ssh=verify_ssh)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 990, in reboot
self.hard_reboot()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster_azure.py", line 127, in hard_reboot
self._instance.reboot(wait=True, hard=True)
File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/provisioner.py", line 94, in reboot
self._provisioner.reboot_instance(self.name, wait, hard)
File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/azure/provisioner.py", line 157, in reboot_instance
self._vm_provider.reboot(name, wait, hard)
File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/azure/virtual_machine_provider.py", line 150, in reboot
self.run_command(name, f"reboot {flags}")
File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/azure/virtual_machine_provider.py", line 232, in run_command
result = self._azure_service.compute.virtual_machines.begin_run_command(
File "/usr/local/lib/python3.10/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machines_operations.py", line 2492, in begin_run_command
raw_result = self._run_command_initial(
File "/usr/local/lib/python3.10/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machines_operations.py", line 2441, in _run_command_initial
pipeline_response = self._client._pipeline.run(request, stream=False, **kwargs)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 230, in run
return first_node.send(pipeline_request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.10/site-packages/azure/mgmt/core/policies/_base.py", line 46, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/policies/_redirect.py", line 197, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/policies/_retry.py", line 553, in send
raise err
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/policies/_retry.py", line 531, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/policies/_authentication.py", line 126, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 119, in send
self._sender.send(request.http_request, **request.context.options),
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 386, in send
raise error
azure.core.exceptions.ServiceRequestError: (<urllib3.connection.HTTPSConnection object at 0x7f411d157820>, 'Connection to management.azure.com timed out. (connect timeout=300)')
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5056, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3881, in disrupt_decommission_streaming_err
self.start_and_interrupt_decommission_streaming()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3802, in start_and_interrupt_decommission_streaming
ParallelObject(objects=[trigger, watcher], timeout=terminate_pattern.timeout).call_objects()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 509, in call_objects
return self.run(lambda x: x(), ignore_exceptions=ignore_exceptions)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 488, in run
raise ParallelObjectException(results=results)
sdcm.utils.common.ParallelObjectException: functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7f4117bb44c0>>, log_follower=<sdcm.utils.file.ReiterableGenerator object at 0x7f40bab4d630>, disrupt_func=<bound method Nemesis.reboot_node of <sdcm.nemesis.SisyphusMonkey object at 0x7f4117bb44c0>>, disrupt_func_kwargs={'target_node': <sdcm.cluster_azure.AzureNode object at 0x7f411e5c1a80>, 'hard': True, 'verify_ssh': True}, delay=0):
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 472, in run
result = future.result(time_out)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 448, in inner
return_val = fun(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 509, in <lambda>
return self.run(lambda x: x(), ignore_exceptions=ignore_exceptions)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3737, in _call_disrupt_func_after_expression_logged
disrupt_func(**disrupt_func_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 318, in inner_func
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3688, in reboot_node
target_node.reboot(hard=hard, verify_ssh=verify_ssh)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 990, in reboot
self.hard_reboot()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster_azure.py", line 127, in hard_reboot
self._instance.reboot(wait=True, hard=True)
File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/provisioner.py", line 94, in reboot
self._provisioner.reboot_instance(self.name, wait, hard)
File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/azure/provisioner.py", line 157, in reboot_instance
self._vm_provider.reboot(name, wait, hard)
File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/azure/virtual_machine_provider.py", line 150, in reboot
self.run_command(name, f"reboot {flags}")
File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/azure/virtual_machine_provider.py", line 232, in run_command
result = self._azure_service.compute.virtual_machines.begin_run_command(
File "/usr/local/lib/python3.10/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machines_operations.py", line 2492, in begin_run_command
raw_result = self._run_command_initial(
File "/usr/local/lib/python3.10/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machines_operations.py", line 2441, in _run_command_initial
pipeline_response = self._client._pipeline.run(request, stream=False, **kwargs)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 230, in run
return first_node.send(pipeline_request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.10/site-packages/azure/mgmt/core/policies/_base.py", line 46, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/policies/_redirect.py", line 197, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/policies/_retry.py", line 553, in send
raise err
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/policies/_retry.py", line 531, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/policies/_authentication.py", line 126, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 86, in send
response = self.next.send(request)
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 119, in send
self._sender.send(request.http_request, **request.context.options),
File "/usr/local/lib/python3.10/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 386, in send
raise error
azure.core.exceptions.ServiceRequestError: (<urllib3.connection.HTTPSConnection object at 0x7f411d157820>, 'Connection to management.azure.com timed out. (connect timeout=300)')

later still problem was persisting failing to list resources with the same error which led to missing logs and failure in resources cleanup. Nothing specific was present in RG activity log.

Impact

test failure and missing logs and leave resources up.

How frequently does it reproduce?

First time seen

Installation details

Kernel Version: 5.15.0-1053-azure Scylla version (or git commit hash): 5.5.0~dev-20240119.b1ba904c4977 with build-id 7a5829efb1f6ef7b467d2dc837300abcc0b739c8

Cluster size: 4 nodes (Standard_L16s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2024-01-20T02-21-36 (azure: undefined_region)

Test: longevity-1tb-5days-azure-test Test id: bd04be7c-8fd7-4d51-b683-57c4491dcae3 Test name: scylla-master/longevity/longevity-1tb-5days-azure-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor bd04be7c-8fd7-4d51-b683-57c4491dcae3` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=bd04be7c-8fd7-4d51-b683-57c4491dcae3) - Show all stored logs command: `$ hydra investigate show-logs bd04be7c-8fd7-4d51-b683-57c4491dcae3` ## Logs: *No logs captured during this run.* [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-1tb-5days-azure-test/34/) [Argus](https://argus.scylladb.com/test/d03f9d55-942a-4f66-ae48-9d35d59a59e2/runs?additionalRuns[]=bd04be7c-8fd7-4d51-b683-57c4491dcae3)
fruch commented 5 months ago

@soyacz

Could be that we somehow going via the external API endpoint ?

And there's a way to force the client to connect locally ?

See: https://learn.microsoft.com/en-us/azure/api-management/troubleshoot-response-timeout-and-errors