scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 95 forks source link

monitoring failed setup [Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 3930 (unattended-upgr)] #6563

Closed fruch closed 1 year ago

fruch commented 1 year ago

Issue description

monitor seems to fail to do it's setup/installations, cause dpk lock is held by 3930 (unattended-upgr)

2023-09-03 06:15:43.580: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=09513ba8-cb66-4ce7-8fff-d78f557c0acc, source=LongevityTest.SetUp()
exception=[<sdcm.cluster_aws.MonitorSetAWS object at 0x7ff7b83235e0>]:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 428, in run
result = future.result(time_out)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 404, in inner
return_val = fun(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 881, in <lambda>
func=(lambda m: m.wait_for_init()),
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3845, in wrapper
verify_node_setup(start_time)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3783, in verify_node_setup
raise NodeSetupFailed(node=node, error_msg=setup_exception[0], traceback_str=setup_exception[1])
sdcm.cluster.NodeSetupFailed: [Node longevity-100gb-4h-master-monitor-node-62931180-1 [44.197.245.243 | 10.12.0.233] (seed: False)] NodeSetupFailed: Encountered a bad command exit code!
Command: 'sudo bash -ce \'\ncurl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\nsudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"\nsudo apt-get update\nsudo apt-get install -y docker docker.io\napt-get install -y software-properties-common\napt-get install -y python3 python3-dev\napt-get install -y python-setuptools unzip wget\napt-get install -y python3-pip\npython3 -m pip install pyyaml\npython3 -m pip install -I -U psutil\nsystemctl start docker\n\''
Exit code: 100
Stdout:
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/archive_uri-https_download_docker_com_linux_ubuntu-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/archive_uri-https_download_docker_com_linux_ubuntu-jammy.list
Hit:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease
Get:3 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Hit:4 https://download.docker.com/linux/ubuntu jammy InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Fetched 109 kB in 1s (142 kB/s)
Reading package lists...
Stderr:
Warning: apt-key is deprecated. Manage keyring files in trusted.gpg.d instead (see apt-key(8)).
W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 3930 (unattended-upgr)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3767, in node_setup
cl_inst.node_setup(_node, **setup_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 5227, in node_setup
self.install_scylla_monitoring(node)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 119, in inner
res = func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 5579, in install_scylla_monitoring
self.install_scylla_monitoring_prereqs(node)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 5336, in install_scylla_monitoring_prereqs
node.remoter.run(cmd="sudo bash -ce '%s'" % prereqs_script)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'sudo bash -ce \'\ncurl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\nsudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"\nsudo apt-get update\nsudo apt-get install -y docker docker.io\napt-get install -y software-properties-common\napt-get install -y python3 python3-dev\napt-get install -y python-setuptools unzip wget\napt-get install -y python3-pip\npython3 -m pip install pyyaml\npython3 -m pip install -I -U psutil\nsystemctl start docker\n\''
Exit code: 100
Stdout:
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/archive_uri-https_download_docker_com_linux_ubuntu-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/archive_uri-https_download_docker_com_linux_ubuntu-jammy.list
Hit:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease
Get:3 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Hit:4 https://download.docker.com/linux/ubuntu jammy InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Fetched 109 kB in 1s (142 kB/s)
Reading package lists...
Stderr:
Warning: apt-key is deprecated. Manage keyring files in trusted.gpg.d instead (see apt-key(8)).
W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 3930 (unattended-upgr)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

How frequently does it reproduce?

Seen only once, so far

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-006f6d3d045731dd6 (aws: undefined_region)

Test: longevity-100gb-4h-test Test id: 62931180-dfdd-4aa9-b1cd-476dcd8d3600 Test name: scylla-master/longevity/longevity-100gb-4h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 62931180-dfdd-4aa9-b1cd-476dcd8d3600` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=62931180-dfdd-4aa9-b1cd-476dcd8d3600) - Show all stored logs command: `$ hydra investigate show-logs 62931180-dfdd-4aa9-b1cd-476dcd8d3600` ## Logs: - **db-cluster-62931180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/db-cluster-62931180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/db-cluster-62931180.tar.gz) - **sct-runner-events-62931180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/sct-runner-events-62931180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/sct-runner-events-62931180.tar.gz) - **sct-62931180.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/sct-62931180.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/sct-62931180.log.tar.gz) - **loader-set-62931180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/loader-set-62931180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/loader-set-62931180.tar.gz) - **monitor-set-62931180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/monitor-set-62931180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/monitor-set-62931180.tar.gz) - **parallel-timelines-report-62931180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/parallel-timelines-report-62931180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/62931180-dfdd-4aa9-b1cd-476dcd8d3600/20230903_061734/parallel-timelines-report-62931180.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-100gb-4h-test/676/) [Argus](https://argus.scylladb.com/test/f05fea04-eb74-4961-94fb-f71c67df52cb/runs?additionalRuns[]=62931180-dfdd-4aa9-b1cd-476dcd8d3600)
fruch commented 1 year ago

looks like we are calling disable_daily_triggered_services only for db nodes, so the distro can still decide it want to do upgrades while the test want to install things

fruch commented 1 year ago

I think we should consider calling disable_daily_triggered_services across all node setups we are doing

juliayakovlev commented 1 year ago

Received similar in https://jenkins.scylladb.com/job/enterprise-2023.1/job/artifacts/job/artifacts-ubuntu2004-fips-test/

2023-10-01 18:13:15.735: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=4c82d8a6-21f5-4c3a-8b57-04488277f413, source=ArtifactsTest.SetUp()
exception=[Node artifacts-ubuntu2004-fips-jenkins-db-node-d289941e-1 [44.199.236.142 | 10.12.0.192] (seed: True)] NodeSetupFailed: Encountered a bad command exit code!
Command: 'sudo bash -cxe "\nexport DEBIAN_FRONTEND=noninteractive\napt-get install software-properties-common -y\n"'
Exit code: 100
Stdout:
Stderr:
+ export DEBIAN_FRONTEND=noninteractive
+ DEBIAN_FRONTEND=noninteractive
+ apt-get install software-properties-common -y
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 3901 (apt-get)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
Traceback (most recent call last):
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/cluster.py", line 3711, in node_setup
cl_inst.node_setup(_node, **setup_kwargs)
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/cluster.py", line 4395, in node_setup
self._scylla_install(node)
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/cluster.py", line 4477, in _scylla_install
node.install_scylla(scylla_repo=self.params.get('scylla_repo'))
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/cluster.py", line 1945, in install_scylla
self.remoter.run('sudo bash -cxe "%s"' % install_prereqs)
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
result = _run()
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
return func(*args, **kwargs)
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
result = connection.run(**command_kwargs)
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/tmp/jenkins/workspace/enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'sudo bash -cxe "\nexport DEBIAN_FRONTEND=noninteractive\napt-get install software-properties-common -y\n"'
Exit code: 100
Stdout:
Stderr:
+ export DEBIAN_FRONTEND=noninteractive
+ DEBIAN_FRONTEND=noninteractive
+ apt-get install software-properties-common -y
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 3901 (apt-get)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 1 nodes (i3.large)

Scylla Nodes used in this run:

OS / Image: ami-03cf7ddd346310b5f (aws: us-east-1)

Test: artifacts-ubuntu2004-fips-test Test id: d289941e-9de9-4693-9f01-3b836a0dc602 Test name: enterprise-2023.1/artifacts/artifacts-ubuntu2004-fips-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor d289941e-9de9-4693-9f01-3b836a0dc602` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=d289941e-9de9-4693-9f01-3b836a0dc602) - Show all stored logs command: `$ hydra investigate show-logs d289941e-9de9-4693-9f01-3b836a0dc602` ## Logs: - **db-cluster-d289941e.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d289941e-9de9-4693-9f01-3b836a0dc602/20231001_181345/db-cluster-d289941e.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d289941e-9de9-4693-9f01-3b836a0dc602/20231001_181345/db-cluster-d289941e.tar.gz) - **sct-runner-events-d289941e.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d289941e-9de9-4693-9f01-3b836a0dc602/20231001_181345/sct-runner-events-d289941e.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d289941e-9de9-4693-9f01-3b836a0dc602/20231001_181345/sct-runner-events-d289941e.tar.gz) - **sct-d289941e.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d289941e-9de9-4693-9f01-3b836a0dc602/20231001_181345/sct-d289941e.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d289941e-9de9-4693-9f01-3b836a0dc602/20231001_181345/sct-d289941e.log.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/artifacts/job/artifacts-ubuntu2004-fips-test/5/) [Argus](https://argus.scylladb.com/test/d59aa4df-b8e1-4ff1-abfa-42078cd1b062/runs?additionalRuns[]=d289941e-9de9-4693-9f01-3b836a0dc602)
roydahan commented 1 year ago

@fruch why is it in "Waiting for Review"? What review? :)

fruch commented 1 year ago

@fruch why is it in "Waiting for Review"? What review? :)

It's a mix

This is solved in: https://github.com/scylladb/scylla-cluster-tests/pull/6759