mgmt-sa-tiger-team / skylight

Windows focused Ansible workshop
GNU General Public License v3.0
51 stars 30 forks source link

TASK [ansible-tower : run the tower installer] fails #79

Open brianstinehart opened 5 years ago

brianstinehart commented 5 years ago

Hi all,

Consistently running into the below issue, which is occurring on both Engine running on a local Mac and Tower running RHEL 7.4; each of these two instances are completely isolated from one another and run independently. I have had many successful runs in the past, with my most recent success on October 3. I cannot think of anything that has changed across both of the two systems mentioned...

fatal: [s2-tower]: FAILED! => {"changed": true, "cmd": "./setup.sh", "delta": "0:01:04.793110", "end": "2019-10-14 07:38:02.368272", "msg": "non-zero return code", "rc": 2, "start": "2019-10-14 07:36:57.575162", "stderr": "", "stderr_lines": [], "stdout": "\u001b[0;34mUsing /etc/ansible/ansible.cfg as config file\u001b[0m\n\u001b[1;35m [WARNING]: Could not match supplied host pattern, ignoring: instancegroup\u001b[0m\n\u001b[1;35m\u001b[0m\n\nPLAY [tower:database:instancegroup] *****\n\nTASK [check_config_static : Ensure expected variables are defined] *\n\u001b[0;36mskipping: [localhost] => (item=tower_package_name) => {\"ansible_loop_var\": \"item\", \"changed\": false, \"item\": \"tower_package_name\", \"skip_reason\": \"Conditional result was False\"}\u001b[0m\n\u001b[0;36mskipping: [localhost] => (item=tower_package_version) => {\"ansible_loop_var\": \"item\", \"changed\": false, \"item\": \"tower_package_version\", \"skip_reason\": \"Conditional result was False\"}\u001b[0m\n\u001b[0;36mskipping: [localhost] => (item=tower_package_release) => {\"ansible_loop_var\": \"item\", \"changed\": false, \"item\": \"tower_package_release\", \"skip_reason\": \"Conditional result was False\"}\u001b[0m\n\nTASK [check_config_static : Detect unsupported HA inventory file] **\n\u001b[0;36mskipping: [localhost]...

Attached below is the massive error message that includes 116 skipped tasks, among a handful of changes and the single failure.

error.txt

Any ideas?

cigamit commented 5 years ago

I have seen this occasionally with EC2 RHEL images, it doesn't properly subscribe to all the repos on some VMs (system is not properly registered?). You deployed multiple tower servers, some of them got passed this step, 1 did not, so its not an issue with every VM (and it seems to work when I do it). We already have a task for enabling this repo, so the only real thing we could probably do is move it all to CentOS instead.

Parsed out the error for easier reading...

Repository 'rhui-REGION-rhel-server-extras' is missing name in configuration, using id Failed to get region name from EC2

One of the configured repositories failed (Unknown), and yum doesn't have enough cached data to continue. At this point the only safe thing yum can do is fail. There are a few ways to work "fix" this:

  1. Contact the upstream for the repository and get them to fix the problem.
  2. Reconfigure the baseurl/etc. for the repository, to point to a working upstream. This is most often useful if you are using a newer distribution release than is supported by the repository (and the packages for the previous distribution release still work).
  3. Run the command with the repository temporarily disabled yum --disablerepo= ...
  4. Disable the repository permanently, so yum won't use it by default. Yum will then just ignore the repository until you permanently enable it again or use --enablerepo for temporary usage: yum-config-manager --disable or subscription-manager repos --disable=
  5. Configure the failing repository to be skipped, if it is unavailable. Note that yum will try to contact the repo. when it runs most commands, so will have to try and fail each time (and thus. yum will be be much slower). If it is a very temporary problem though, this is often a nice compromise: yum-config-manager --save --setopt=.skip_if_unavailable=true

Cannot find a valid baseurl for repo: rhui-REGION-rhel-server-extras

brianstinehart commented 5 years ago

Thanks, Jimmy.

It's worth noting all Tower servers are failing at this step for me (both s2-tower and s1-tower, in the example; a previous run failed on all 50+ Tower servers), so none of the machines are getting past this step and the entire run is failing. I'm not able to deploy Skylight with the repo as is.

The missing RHEL repo is definitely the problem, though. We were able to complete the deployment by replacing the RHEL-7.6_HVM_GA* lookup entry in the vars/main.yml file with a preexisting image that we knew had the repos available.

Considering it works when you deploy it, I'm guessing this is a quirk with the image available in our region (ap-southeast-2). Any idea why the VMs are not properly subscribing to all the repos? You mentioned it's possibly that the systems are not registering. Why would that be? Is that something I can look into on my end?

oatakan commented 5 years ago

@brianstinehart you can try using centos7. see this PR on how to do this: https://github.com/mgmt-sa-tiger-team/skylight/pull/81

You can use 'centos7-ec2-support' branch in the mean time (until it's merged to develop branch).

cigamit commented 5 years ago

Appears to caused by AWS https://github.com/ansible/workshops/pull/498/files

brianstinehart commented 5 years ago

This is now failing in us-east-1 as well, and I've not been able to successfully introduce the fixes in the link above into Skylight. I've spent quite a bit of time trying to update the RHUI client and use the new repos, but I've run into all sorts of issues, much of which to do with the CDS load balancers in region.

@cigamit are things still deploying correctly for you?

@oatakan I wasn't able to get the CentOS deployment working in the develop branch either. Problems with SSH to the servers, though I haven't spent heaps of time trying to resolve that issue.

oatakan commented 5 years ago

@brianstinehart it works for me using these in extra_vars with the develop branch (https://github.com/mgmt-sa-tiger-team/skylight/pull/81):

ec2_docs_instance_ami_type: centos7 ec2_gitlab_instance_ami_type: centos7 ec2_tower_instance_ami_type: centos7 root_user: centos

did you try using these variables?

cigamit commented 5 years ago

We might change the defaults to those until we resolve the issues with the RHEL images

brianstinehart commented 5 years ago

@oatakan root_user was my previous issue. I'm deploying into AWS and was still using ec2-user.

Working perfectly with develop branch using those vars and hosted 3 workshops this week. Cheers for sorting me out :)

rmahroua commented 5 years ago

Hey guys, I faced the same issue this morning and did some research. The issue is actually related to repositories naming mismatch between what the rh-amazon-rhui-client package configures and what the Tower installer is looking for.

In the Tower installer role that deploys the prerequisites, there are three lists that define the repository names - one of which is used in EC2 deployments.

The fix consists in the following:

1: Create a environment file that overrides the repository names 2: Run the setup.sh installer with this environment file using the -e flag.

Here's what I added in the roles/ansible-tower/tasks/setup.yml playbook:

- name: Create the environment file (fixes the incorrect repository lookup)
  copy:
    dest: /tmp/ansible-tower-setup-{{ towerversion }}/install_vars.yml
    content: |
      redhat_aws_rhui_repos:
        - rhel-server-rhui-rhscl-7-rpms
        - rhel-7-server-rhui-extras-rpms

- name: Run the tower installer
  shell: ./setup.sh -e "@install_vars.yml" chdir=/tmp/ansible-tower-setup-{{ towerversion }}
  when: towerchk not in towerversion

Those are the default values that the Tower installer is using (those are currently incorrect):

redhat_aws_rhui_repos:
    - rhui-REGION-rhel-server-extras
    - rhui-REGION-rhel-server-rhscl

I am currently testing this fix and will report here if this works.

Cheers

rmahroua commented 5 years ago

UPDATE: I confirm that this fix works. I was able to proceed with the installation:

[root@tower ansible-tower-setup-3.5.2-1]# cat install_vars.yml
redhat_aws_rhui_repos:
  - rhel-server-rhui-rhscl-7-rpms
  - rhel-7-server-rhui-extras-rpms
Screenshot 2019-10-31 13 49 51
oatakan commented 5 years ago

@rmahroua this is great. This sounds like an issue that should be fixed in the Tower installer it seems? I just pushed a PR to move Tower node to rhel8 on ec2. I recommend that going forward we switch to rhel8 on ec2 by default: https://github.com/mgmt-sa-tiger-team/skylight/pull/82

Let's document your findings in the troubleshooting section for now, if it persist with the next version of Tower (3.6.x) and there is a compelling reason to support rhel7 we can add this fix with conditions etc then. If that sounds good, would you be willing to submit a PR against the README? Thanks.

rmahroua commented 5 years ago

Sounds good @oatakan I will submit a PR for the troubleshooting section to the dev branch

kenmoini commented 5 years ago

I can also confirm that the fix provided by @rmahroua also works to resolve the issue at the Tower Install task. Great job, thank you.

brianstinehart commented 5 years ago

Great work @rmahroua!! Thank you so much for taking the time to unpack this

Confirming this worked for me as well, but I did run into a destination directory (/tmp/ansible-tower-setup-{{ towerversion }}/) problem in my last run:

TASK [ansible-tower : Create the environment file (fixes the incorrect repository lookup)] *** fatal: [s1-tower]: FAILED! => ... "msg": "Destination directory /tmp/ansible-tower-setup-3.5.2-1 does not exist"

Updating the tasks to the below worked well:


- name: Create the environment file (fixes the incorrect repository lookup)
  copy:
    dest: /tmp/ansible-tower-setup-bundle-{{ towerversion }}.el{{ ansible_distribution_major_version }}/install_vars.yml
    content: |
      redhat_aws_rhui_repos:
        - rhel-server-rhui-rhscl-7-rpms
        - rhel-7-server-rhui-extras-rpms

- name: run the tower installer
  shell: ./setup.sh -e "@install_vars.yml" chdir=/tmp/ansible-tower-setup-bundle-{{ towerversion }}.el{{ ansible_distribution_major_version }}
  when: towerchk not in towerversion
rmahroua commented 5 years ago

Thanks for the update @brianstinehart looks like the extraction directory did not have the same name.