scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
52 stars 83 forks source link

loader setup failing [Connection reset by peer][download.docker.com:443] #7537

Open fruch opened 1 month ago

fruch commented 1 month ago

Issue description

During the setup of loader node, we fail to connect to download.docker.com:443

2024-06-01 03:18:43.368: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=b1902a93-bf35-4df7-a37f-93462f19c321, source=UpgradeTest.SetUp()
exception=[Node rolling-upgrade--ubuntu-focal-loader-node-68e21400-1 [52.214.80.33 | 10.4.1.238]] NodeSetupFailed: Encountered a bad command exit code!
Command: 'sudo bash -cxe "\ncurl -fsSL get.docker.com --retry 5 --retry-max-time 300 -o get-docker.sh\nsh get-docker.sh\nsystemctl enable docker\nsystemctl start docker\n"'
Exit code: 35
Stdout:
# Executing docker install script, commit: 6d9743e9656cc56f699a64800b098d5ea5a60020
Stderr:
If you installed the current Docker package using this script and are using it
again to update Docker, you can safely ignore this message.
You may press Ctrl+C now to abort this script.
+ sleep 20
+ sh -c apt-get update -qq >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null
+ sh -c install -m 0755 -d /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" -o /etc/apt/keyrings/docker.asc
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to download.docker.com:443
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3772, in node_setup
cl_inst.node_setup(_node, **setup_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 5032, in node_setup
node.remoter.run('sudo bash -cxe "%s"' % docker_install)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'sudo bash -cxe "\ncurl -fsSL get.docker.com --retry 5 --retry-max-time 300 -o get-docker.sh\nsh get-docker.sh\nsystemctl enable docker\nsystemctl start docker\n"'
Exit code: 35
Stdout:
# Executing docker install script, commit: 6d9743e9656cc56f699a64800b098d5ea5a60020
Stderr:
If you installed the current Docker package using this script and are using it
again to update Docker, you can safely ignore this message.
You may press Ctrl+C now to abort this script.
+ sleep 20
+ sh -c apt-get update -qq >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null
+ sh -c install -m 0755 -d /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" -o /etc/apt/keyrings/docker.asc
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to download.docker.com:443

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (im4gn.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0a50b549066a0f6e8 (aws: undefined_region)

Test: rolling-upgrade-ami-arm-test Test id: 68e21400-3b56-416b-95dc-fc98a690d901 Test name: scylla-master/rolling-upgrade/rolling-upgrade-ami-arm-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 68e21400-3b56-416b-95dc-fc98a690d901` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=68e21400-3b56-416b-95dc-fc98a690d901) - Show all stored logs command: `$ hydra investigate show-logs 68e21400-3b56-416b-95dc-fc98a690d901` ## Logs: - **db-cluster-68e21400.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/db-cluster-68e21400.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/db-cluster-68e21400.tar.gz) - **sct-runner-events-68e21400.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/sct-runner-events-68e21400.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/sct-runner-events-68e21400.tar.gz) - **sct-68e21400.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/sct-68e21400.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/sct-68e21400.log.tar.gz) - **loader-set-68e21400.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/loader-set-68e21400.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/loader-set-68e21400.tar.gz) - **monitor-set-68e21400.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/monitor-set-68e21400.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/68e21400-3b56-416b-95dc-fc98a690d901/20240601_032343/monitor-set-68e21400.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-ami-arm-test/111/) [Argus](https://argus.scylladb.com/test/8e5f044d-57c2-43ad-8fe8-ac75a11a422d/runs?additionalRuns[]=68e21400-3b56-416b-95dc-fc98a690d901)
fruch commented 1 month ago

happened also in:

2024-06-01 03:20:04.646: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=d703ff5f-89f9-4fdb-b14f-fd8005dad068, source=LongevityTest.SetUp()
exception=[Node longevity-parallel-topology-schema--loader-node-60b61409-2 [46.51.151.197 | 10.4.10.88]] NodeSetupFailed: Encountered a bad command exit code!
Command: 'sudo bash -cxe "\ncurl -fsSL get.docker.com --retry 5 --retry-max-time 300 -o get-docker.sh\nsh get-docker.sh\nsystemctl enable docker\nsystemctl start docker\n"'
Exit code: 35
Stdout:
Stderr:
+ curl -fsSL get.docker.com --retry 5 --retry-max-time 300 -o get-docker.sh
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to get.docker.com:443
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3801, in node_setup
cl_inst.node_setup(_node, **setup_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 5063, in node_setup
node.remoter.run('sudo bash -cxe "%s"' % docker_install)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'sudo bash -cxe "\ncurl -fsSL get.docker.com --retry 5 --retry-max-time 300 -o get-docker.sh\nsh get-docker.sh\nsystemctl enable docker\nsystemctl start docker\n"'
Exit code: 35
Stdout:
Stderr:
+ curl -fsSL get.docker.com --retry 5 --retry-max-time 300 -o get-docker.sh
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to get.docker.com:443

Installation details

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0562d0d356244521c (aws: undefined_region)

Test: longevity-schema-topology-changes-12h-test Test id: 60b61409-5344-4ba1-96c8-24e981396149 Test name: scylla-6.0/tier1/longevity-schema-topology-changes-12h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 60b61409-5344-4ba1-96c8-24e981396149` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=60b61409-5344-4ba1-96c8-24e981396149) - Show all stored logs command: `$ hydra investigate show-logs 60b61409-5344-4ba1-96c8-24e981396149` ## Logs: - **db-cluster-60b61409.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/db-cluster-60b61409.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/db-cluster-60b61409.tar.gz) - **sct-runner-events-60b61409.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/sct-runner-events-60b61409.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/sct-runner-events-60b61409.tar.gz) - **sct-60b61409.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/sct-60b61409.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/sct-60b61409.log.tar.gz) - **loader-set-60b61409.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/loader-set-60b61409.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/loader-set-60b61409.tar.gz) - **monitor-set-60b61409.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/monitor-set-60b61409.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/60b61409-5344-4ba1-96c8-24e981396149/20240601_032903/monitor-set-60b61409.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-6.0/job/tier1/job/longevity-schema-topology-changes-12h-test/9/) [Argus](https://argus.scylladb.com/test/70e5a5dc-4553-4788-8943-c6da495a730f/runs?additionalRuns[]=60b61409-5344-4ba1-96c8-24e981396149)
fruch commented 1 month ago

we probably can add more retries, and introduce them where they are missing

soyacz commented 1 month ago

these 2 are from similar time - possibly get.docker.com had some temporary issue.

fruch commented 1 month ago

these 2 are from similar time - possibly get.docker.com had some temporary issue.

I agree, this one is a low severity