gluster / project-infrastructure

Issues related to GlusterFs infrastructure components.
0 stars 0 forks source link

test case ./tests/bugs/distribute/bug-882278.t is continuous failing #170

Open mohit84 opened 2 years ago

mohit84 commented 2 years ago

It seems the test case (./tests/bugs/distribute/bug-882278.t ) is continuously failing due to not being able to resolve a hostname. I think we need to update an infra VM to run it successfully.

For more please refer this https://build.gluster.org/job/gh_centos7-regression/2729/consoleFull

mscherer commented 2 years ago

Ok seems 3 builders have been reinstalled (not sure when) or changed their SSH keys, and our ansible deployment was blocked.

Since the rest was working, I do not understand what happened exactly. I am fixing and running again, and will report.

mscherer commented 2 years ago

ok so that was quick:

fatal: [builder-c7-2.aws.gluster.org]: FAILED! => {"changed": false, "msg": "No package matching 'dbench' found available, installed or updated", "rc": 126, "results": ["git-1.8.3.1-23.el7_8.x86_64 providing git is already installed", "sudo-1.8.23-3.el7.x86_64 providing sudo is already installed", "No package matching 'dbench' found available, installed or updated"]}

I guess we need to run another playbook first.

mscherer commented 2 years ago

So, we have 8 builders on AWS instead of 4, and they all have been started by ansible, 1 day after the other.

This kinda messed the automation, so I will clean it up (eg, remove all instances, and run the playbook)

mscherer commented 2 years ago

ok should be good now

mohit84 commented 2 years ago

Thanks Michael !!

mscherer commented 2 years ago

So now, it fail with:

08:27:44 not ok  21 [    126/ 120124] <  83> '0 check_common_secret_file' -> 'Got "1" instead of "0"'
08:27:44 cat: /var/lib/glusterd/geo-replication/primary_secondary_common_secret.pem.pub: No such file or directory

I wonder why it suddenly fail.

mohit84 commented 2 years ago

Some other test case was also failing specific to "No space left on device"

mscherer commented 2 years ago

So in the mean time, I think I found why the server got reinstalled. Our automation detected the AMI changed (because it changed, but the email about that went to another inbox than mine), and so decided to reinstall the server, but without adjusting ansible (or not adjusting properly), which in turn made them half deployed.