scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

`cl_inst.node_setup` failed with `HTTP Error 404 - Not Found` during `sudo yum install -y wget` #6317

Closed fgelcer closed 10 months ago

fgelcer commented 1 year ago

Prerequisites

Versions

Logs

Description

During node setup, we run few dependencies installation, and one of them failed:

2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > Command: 'sudo yum install -y wget'
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > 
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > Exit code: 1
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > 
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > Stdout:
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > 
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > Loaded plugins: fastestmirror
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > Determining fastest mirrors
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >  * base: download.cf.centos.org
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >  * elrepo: ftp.nluug.nl
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >  * epel: ftp.nluug.nl
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >  * extras: download.cf.centos.org
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >  * updates: download.cf.centos.org
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > 
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > Stderr:
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > 
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://repos.mia.lax-noc.com/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://ftp.yz.yamagata-u.ac.jp/pub/linux/RPMS/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://ftp.osuosl.org/pub/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://syd.mirror.rackspace.com/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://ftp.ne.jp/Linux/RPMS/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://elrepo.mirror.angkasa.id/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > https://mirrors.tuna.tsinghua.edu.cn/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] HTTPS Error 404 - Not Found
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://elrepo.mirrors.arminco.com/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] curl#7 - "Failed to connect to 2001:4d00:10:7::144: Network is unreachable"
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://mirrors.colocall.net/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 12] Timeout on http://mirrors.colocall.net/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: (28, 'Connection timed out after 30001 milliseconds')
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > http://ftp.nluug.nl/os/Linux/distr/elrepo/elrepo/el7/x86_64/repodata/2b277ad8223f29a45ba67d50cf2eafe4219e51ee1ff6501a34c4a84dbced36d5-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > 
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > 
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > Traceback (most recent call last):
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3659, in node_setup
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     cl_inst.node_setup(_node, **setup_kwargs)
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4703, in node_setup
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     node_exporter_setup.install(node)
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/node_exporter_setup.py", line 10, in install
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     node.install_package('wget')
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     return func(*args, **kwargs)
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 1733, in install_package
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     self.remoter.sudo(f'{pkg_cmd} install -y {package_name}')
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/base.py", line 123, in sudo
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     return self.run(cmd=cmd,
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     result = _run()
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     return func(*args, **kwargs)
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     result = connection.run(**command_kwargs)
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR >     raise UnexpectedExit(result)
2023-06-19 13:56:54,049 f:tester.py       l:489  c:PerformanceRegressionTest p:ERROR > sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Steps to Reproduce

not sure why it happened, as it could be an external problem (with the repo we tried to access), or to local network issue, or something else, but no idea how to reproduce.

Expected behavior: test should retry few times, when getting 404 error (or even other different errors), instead of just failing (if possible to do it, in such early stage)

fruch commented 1 year ago

This type of failure happens quite a look with centos repos, especially centos7 we I'm not sure for how long it's gonna be maintained.

I think we can add retries in all install commands we have (lot of them do have)

fgelcer commented 1 year ago

This type of failure happens quite a look with centos repos, especially centos7 we I'm not sure for how long it's gonna be maintained.

I think we can add retries in all install commands we have (lot of them do have)

usually we ignore such failures and just re-run, but @mykaul asked for this issue, so i will try this or next sprint to go over the installation commands, and try to add some retries

fruch commented 10 months ago

Seems like we are not installing EPEL on centos7 (and oel76);before installing Scylla

And we might get old reference to EPEL that doesn't exist

We should try installing EPEL on centos7 and see if it improve the situation.

In general we should try checking if EPEL is really needed, and maybe disable it by default And enable it only for the packages needs it

fruch commented 10 months ago

Here is one example:

How frequently does it reproduce?

On offline artifacts tests it's happening once in a few days

Installation details

Cluster size: 1 nodes (i3.large)

Scylla Nodes used in this run:

OS / Image: ami-0d6a24fe35fdf5dc4 (aws: undefined_region)

Test: artifacts-oel76-test Test id: d5cbbc5a-d737-4803-8caa-87a999a6e5d3 Test name: scylla-master/artifacts-offline-install/artifacts-oel76-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor d5cbbc5a-d737-4803-8caa-87a999a6e5d3` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=d5cbbc5a-d737-4803-8caa-87a999a6e5d3) - Show all stored logs command: `$ hydra investigate show-logs d5cbbc5a-d737-4803-8caa-87a999a6e5d3` ## Logs: - **db-cluster-d5cbbc5a.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d5cbbc5a-d737-4803-8caa-87a999a6e5d3/20231108_052645/db-cluster-d5cbbc5a.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d5cbbc5a-d737-4803-8caa-87a999a6e5d3/20231108_052645/db-cluster-d5cbbc5a.tar.gz) - **sct-runner-events-d5cbbc5a.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d5cbbc5a-d737-4803-8caa-87a999a6e5d3/20231108_052645/sct-runner-events-d5cbbc5a.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d5cbbc5a-d737-4803-8caa-87a999a6e5d3/20231108_052645/sct-runner-events-d5cbbc5a.tar.gz) - **sct-d5cbbc5a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d5cbbc5a-d737-4803-8caa-87a999a6e5d3/20231108_052645/sct-d5cbbc5a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d5cbbc5a-d737-4803-8caa-87a999a6e5d3/20231108_052645/sct-d5cbbc5a.log.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/artifacts-offline-install/job/artifacts-oel76-test/270/) [Argus](https://argus.scylladb.com/test/84715b90-86ee-40e7-aa89-2d00d1597bf8/runs?additionalRuns[]=d5cbbc5a-d737-4803-8caa-87a999a6e5d3)
roydahan commented 10 months ago

Hopefully these issues will not be relevant once centOS7 will be EOL I don't want to start tailoring a dedicated solution, if we can workaround it with 10 retries (or even more) I would try that first.

fruch commented 10 months ago

once we have a solution - try it for a whole day (i.e. 20-30 runs)

juliayakovlev commented 10 months ago

Running here: https://jenkins.scylladb.com/job/scylla-staging/job/yulia/job/artifacts-oel76-test/

fruch commented 10 months ago

Running here: https://jenkins.scylladb.com/job/scylla-staging/job/yulia/job/artifacts-oel76-test/

can you explain what's there ? or point to PR/code ?

juliayakovlev commented 10 months ago

I added retrying in two places, just see if the issue will reproduced again. Will open a draft PR

juliayakovlev commented 10 months ago

Running here: https://jenkins.scylladb.com/job/scylla-staging/job/yulia/job/artifacts-oel76-test/

can you explain what's there ? or point to PR/code ?

https://github.com/scylladb/scylla-cluster-tests/pull/6800/files

fruch commented 10 months ago

https://github.com/scylladb/scylla-cluster-tests/pull/6809 should be fixing this one