scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
58 stars 95 forks source link

failed to generate snapshot inside `@latency_calculator_decorator` [bind: address already in use] #7320

Open fruch opened 7 months ago

fruch commented 7 months ago

failed to generate snapshot inside @latency_calculator_decorator

2024-04-03 21:58:08.277: (DisruptionEvent Severity.ERROR) period_type=end event_id=7640cf11-680b-4b2f-8a36-6456bdb4a4b9 duration=1h13m56s: nemesis_name=MgmtCorruptThenRepair target_node=Node longevity-100gb-4h-tablets-db-node-9d1765f1-5 [52.19.17.254 | 10.4.11.143] (seed: True) errors=500 Server Error for http+docker://ssh/v1.43/containers/f40847543b57c6abb442bbb2c602f02dcbd0163531381a32f6f44b54171d5971/start: Internal Server Error ("driver failed programming external connectivity on endpoint longevity-100gb-4h-tablets-monitor-node-9d1765f1-1-webdriver (79ce546652edac50e797a24306640a3b84609f57c4b98fd0783859d7559d9e93): Error starting userland proxy: listen tcp4 [0.0.0.0:32917](http://0.0.0.0:32917/): bind: address already in use")
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/docker/api/client.py", line 268, in _raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://ssh/v1.43/containers/f40847543b57c6abb442bbb2c602f02dcbd0163531381a32f6f44b54171d5971/start
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5117, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2989, in disrupt_mgmt_corrupt_then_repair
self._mgmt_repair_cli()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 187, in wrapped
screenshots = args[0].monitoring_set.get_grafana_screenshots(node=monitor, test_start_time=start)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 5605, in get_grafana_screenshots
screenshot_files = screenshot_collector.collect(node, self.logdir)
File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 534, in collect
return self.get_grafana_screenshot(node, local_dst)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 529, in get_grafana_screenshot
self.close_browser()
File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 458, in close_browser
self.remote_browser.quit()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/remotewebbrowser.py", line 114, in quit
self.browser.quit()
File "/usr/local/lib/python3.10/functools.py", line 981, in __get__
val = self.func(instance)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/remotewebbrowser.py", line 72, in browser
ContainerManager.run_container(self.node, "web_driver")
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/docker_utils.py", line 225, in run_container
container.start()
File "/usr/local/lib/python3.10/site-packages/docker/models/containers.py", line 404, in start
return self.client.api.start([self.id](http://self.id/), **kwargs)
File "/usr/local/lib/python3.10/site-packages/docker/utils/decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/docker/api/container.py", line 1111, in start
self._raise_for_status(res)
File "/usr/local/lib/python3.10/site-packages/docker/api/client.py", line 270, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/usr/local/lib/python3.10/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error for http+docker://ssh/v1.43/containers/f40847543b57c6abb442bbb2c602f02dcbd0163531381a32f6f44b54171d5971/start: Internal Server Error ("driver failed programming external connectivity on endpoint longevity-100gb-4h-tablets-monitor-node-9d1765f1-1-webdriver (79ce546652edac50e797a24306640a3b84609f57c4b98fd0783859d7559d9e93): Error starting userland proxy: listen tcp4 [0.0.0.0:32917](http://0.0.0.0:32917/): bind: address already in use")

Jenkins run:

https://jenkins.scylladb.com/job/scylla-staging/job/karol_baryla/job/longevity-100gb-4h-cql-stress-test-tablets/4/

Logs

fruch commented 7 months ago

1) I think we should ignore those errors by default 2) we might want to skip this code completely in non-perf test, it's wasting time generating snapshot we never look into on all longevity runs 3) we still might need to look at the issue, since it's not clear who is holding this port, and how we go into such situation (it shouldn't happen)

enaydanov commented 7 months ago

I done some investigations several months ago: https://github.com/scylladb/scylla-cluster-tests/issues/6827

fruch commented 7 months ago

I done some investigations several months before: #6827

but something is off here, it shouldn't happen, it should pick a random port, i.e. use 0, and then check it.

maybe our docker python package is too old, and has issues, maybe it's good time to update it

enaydanov commented 7 months ago

Afair from my investigations, docker-py's containers.run() is not changed since then.

My guess that it can happen under a heavy load. As a workaround we can start remotebrowser docker container before the monitoring stack setup when the monitoring node is pretty idle.