Manager tests fail on monitor node setup stage

mikliapko commented 3 weeks ago

Issue description

Manager tests fail on monitor node setup stage:

2024-05-07 08:57:41.177: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=7a81e8f0-25db-4e6d-90dc-70ed55094a35, source=MgmtCliTest.SetUp()
exception=[<sdcm.cluster_aws.MonitorSetAWS object at 0x7f62620ae230>]:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 482, in run
result = future.result(time_out)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 458, in inner
return_val = fun(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 943, in <lambda>
func=(lambda m: m.wait_for_init()),
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3909, in wrapper
verify_node_setup_or_startup(start_time, setup_queue, setup_results)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3857, in verify_node_setup_or_startup
raise NodeSetupFailed(
sdcm.cluster.NodeSetupFailed: [Node manager-regression-master-monitor-node-65c87002-1 [54.237.151.44 | 10.12.2.68] (dc name: us-east-1)] NodeSetupFailed: Encountered a bad command exit code!
Command: "bash -ce '\nrm -rf /home/centos/sct-monitoring\nmkdir -p /home/centos/sct-monitoring\ncd /home/centos/sct-monitoring\nwget https://github.com/scylladb/scylla-monitoring/archive/None.zip\nrm -rf ./tmp /home/centos/sct-monitoring/scylla-monitoring-src 2>/dev/null\nunzip None.zip -d ./tmp\nmv ./tmp/scylla-monitoring-None/ /home/centos/sct-monitoring/scylla-monitoring-src\nrm -rf ./tmp 2>/dev/null\n'"
Exit code: 8
Stdout:
Stderr:
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/scylladb/scylla-monitoring/zip/None [following]
--2024-05-07 08:57:35--  https://codeload.github.com/scylladb/scylla-monitoring/zip/None
Resolving codeload.github.com (codeload.github.com)... 140.82.113.9
Connecting to codeload.github.com (codeload.github.com)|140.82.113.9|:443... connected.
HTTP request sent, awaiting response... 404 Not Found

I suspect the issue was introduced here https://github.com/scylladb/scylla-cluster-tests/commit/aaca6347260cc8543431ce9761e36ffbe82e465d by setting monitor_branch: null.

Impact

All manager tests running from master branch are broken. We need to take a look into it ASAP.

How frequently does it reproduce?

Always.

Installation details

SCT Version: 3ab8f43225bb69ee9f0a3eddadf6f5e7bc29b469

Logs

Argus: https://argus.scylladb.com/workspace?state=WyI1NDI5NjNlYS1lZGYwLTQ0YmItOGQzNy1iYzc4OGFkMDgxMTMiXQ (execution 1151)

karol-kokoszka commented 3 weeks ago

This is a blocker for manager releases.

fruch commented 3 weeks ago

the test is hardcoding

ami_monitor_user: 'centos'
ami_id_monitor: 'ami-02eac2c0129f6376b' # Official CentOS Linux 7 x86_64 HVM EBS ENA 1901_01

1) centos7 that's deprecated

2) now on AWS/GCP if monitor is replace with an image which isn't monitor, you need to specify the monitor_branch, that can be a work around, for the manager jobs

mikliapko commented 3 weeks ago

the test is hardcoding
ami_monitor_user: 'centos'
ami_id_monitor: 'ami-02eac2c0129f6376b' # Official CentOS Linux 7 x86_64 HVM EBS ENA 1901_01
centos7 that's deprecated

now on AWS/GCP if monitor is replace with an image which isn't monitor, you need to specify the monitor_branch, that can be a work around, for the manager jobs

Would it help if I remove hardcodes for ami_monitor_user and ami_id_monitor in manager yaml configs? Will it go the formal_monitor_image flow in such case?

@fruch

fruch commented 3 weeks ago

the test is hardcoding
ami_monitor_user: 'centos'
ami_id_monitor: 'ami-02eac2c0129f6376b' # Official CentOS Linux 7 x86_64 HVM EBS ENA 1901_01
centos7 that's deprecated

now on AWS/GCP if monitor is replace with an image which isn't monitor, you need to specify the monitor_branch, that can be a work around, for the manager jobs
Would it help if I remove hardcodes for ami_monitor_user and ami_id_monitor in manager yaml configs? Will it go the formal_monitor_image flow in such case?

@fruch

Yes, but it won't be CentOS anymore, it would be Ubuntu.(So you might need to rename the job, and the triggers to pass .list and not a .repo)

Probably for the long run, we should split the manager server from monitoring node.

mikliapko commented 3 weeks ago

Yes, but it won't be CentOS anymore, it would be Ubuntu.(So you might need to rename the job, and the triggers to pass .list and not a .repo)

Thanks, got it, I'll prepare the fixes then.

Probably for the long run, we should split the manager server from monitoring node.

Yep, it would be good to do it.

mikliapko commented 3 weeks ago

@fruch Could you please take a look?

I adjusted configuration files, executing the job with these changes. It fails on monitor node setup stage with the error NodeSetupFailed: Wait for: manager-regression-manager--monitor-node-3f072230-1: Waiting for manager server to be up: timeout - 300 seconds - expired. At the same time I can access the node via ssh from my local PC.

Argus - https://argus.scylladb.com/workspace?state=WyI2YWI1YjgzNy1mMDUxLTQyNjgtODI3Ni01NWU0NGEyYTUxN2QiXQ (see the latest run)

fruch commented 3 weeks ago

@fruch Could you please take a look?

I adjusted configuration files, executing the job with these changes. It fails on monitor node setup stage with the error NodeSetupFailed: Wait for: manager-regression-manager--monitor-node-3f072230-1: Waiting for manager server to be up: timeout - 300 seconds - expired. At the same time I can access the node via ssh from my local PC.

Argus - https://argus.scylladb.com/workspace?state=WyI2YWI1YjgzNy1mMDUxLTQyNjgtODI3Ni01NWU0NGEyYTUxN2QiXQ (see the latest run)

It's says it's waiting for the manger to be up, check the logs to see why it's not up

karol-kokoszka commented 3 weeks ago

May 07 15:41:58 manager-regression-manager--monitor-node-154fed3a-1 scylla-manager[23054]: STARTUP ERROR: configuration ["/etc/scylla-manager/scylla-manager.yaml"]: yaml: unmarshal errors:
May 07 15:41:58 manager-regression-manager--monitor-node-154fed3a-1 scylla-manager[23054]:   line 1: field config_cache not found in type server.Config

@mikliapko The error from manager log says that it read config_cache entry from scylla-manager config, but it doesn't match the real object definition. config_cache in scylla-manager.yaml is something I introduced to SCT repo on my extend_sleep_TLS_en branch (it's not merged to upstream). It should work if you test it against this manager build https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/manager-build/675/artifact/00-Build.txt

mikliapko commented 3 weeks ago

@mikliapko The error from manager log says that it read config_cache entry from scylla-manager config, but it doesn't match the real object definition. config_cache in scylla-manager.yaml is something I introduced to SCT repo on my extend_sleep_TLS_en branch (it's not merged to upstream). It should work if you test it against this manager build https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/manager-build/675/artifact/00-Build.txt

@karol-kokoszka I think this job used the build you mentioned (https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/mikita/job/sct/job/manager-ubuntu-22-sanity/4/parameters/) - 2024-05-02T21:58:02Z.

karol-kokoszka commented 3 weeks ago

This is what I see in parameters:

https://downloads.scylladb.com/manager/deb/unstable/unified-deb/master/latest/scylla-manager.list

It points to master latest.

mikliapko commented 3 weeks ago

This is what I see in parameters:
https://downloads.scylladb.com/manager/deb/unstable/unified-deb/master/latest/scylla-manager.list
It points to master latest.

As I see the latest version in master is downloads.scylladb.com/manager/deb/unstable/unified-deb/master/2024-05-02T21:58:02Z, the same version specified in 00-Build.txt file you mentioned, isn't it?

karol-kokoszka commented 3 weeks ago

.....maybe :) Let me trigger the build from the expected branch once again. So it will be from today.

mikliapko commented 3 weeks ago

.....maybe :) Let me trigger the build from the expected branch once again. So it will be from today.

Good, please let me know then, I'll restart the job with the fresh build

karol-kokoszka commented 3 weeks ago

I just triggered manager-master build pointing to feature-brach_config_cache_service branch. This is the branch I must validate with SCT before merging.

https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/manager-build/676/

karol-kokoszka commented 3 weeks ago

Here are the builds https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/manager-build/676/artifact/00-Build.txt

mikliapko commented 3 weeks ago

Here are the builds https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/manager-build/676/artifact/00-Build.txt

I've triggered the job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/mikita/job/sct/job/manager-ubuntu-22-sanity/7/

scylladb / scylla-cluster-tests