pulibrary / princeton_ansible

Ansible Roles and Playbooks for Princeton University Library
10 stars 3 forks source link

Failing/unreachable inventory hosts in staging group on Tower #4018

Closed acozine closed 5 months ago

acozine commented 1 year ago

If I run the key-update playbook from Tower, I get 5 unreachable hosts:

fatal: [prds-dataspace-staging-endpoint1]: UNREACHABLE! => {. . . Permission denied (publickey). . .}
fatal: [staging-bastion.pulcloud.io]: UNREACHABLE! => {. . . Could not resolve hostname staging-bastion.pulcloud.io: Name or service not known . . . }
fatal: [smoketest.docnow.io]: UNREACHABLE! => { . . . Permission denied (publickey). . . }
fatal: [bastion-staging.pulcloud.io]: UNREACHABLE! => {. . . Connection timed out . . .}
fatal: [oawaiver-staging1.princeton.edu]: UNREACHABLE! => {. . . Connection timed out . . .}

We should either configure working connections for these machines or move them out of the staging group in our inventory.

acozine commented 1 year ago

Related to #3897.

acozine commented 1 year ago

As of today we have some Staging VMs that fail the Patch Tuesday playbook and more that are unreachable:

Failing:

fatal: [dpul-staging2.princeton.edu]: FAILED! => {"changed": true, "cmd": ["/usr/bin/apt-key", "adv", "--refresh-keys", "--keyserver", "keyserver.ubuntu.com"], "delta": "0:00:03.096369", "end": "2023-08-14 11:28:37.643770", "msg": "non-zero return code", "rc": 2, "start": "2023-08-14 11:28:34.547401", "stderr": "Warning: apt-key output should not be parsed (stdout is not a terminal)\ngpg: connecting dirmngr at '/tmp/apt-key-gpghome.rB7Rc24RgJ/S.dirmngr' failed: IPC connect call failed\ngpg: keyserver refresh failed: No dirmngr", "stderr_lines": ["Warning: apt-key output should not be parsed (stdout is not a terminal)", "gpg: connecting dirmngr at '/tmp/apt-key-gpghome.rB7Rc24RgJ/S.dirmngr' failed: IPC connect call failed", "gpg: keyserver refresh failed: No dirmngr"], "stdout": "Executing: /tmp/apt-key-gpghome.rB7Rc24RgJ/gpg.1.sh --refresh-keys --keyserver keyserver.ubuntu.com", "stdout_lines": ["Executing: /tmp/apt-key-gpghome.rB7Rc24RgJ/gpg.1.sh --refresh-keys --keyserver keyserver.ubuntu.com"]}
fatal: [figgy-staging2.princeton.edu]: FAILED! => {"changed": false, "msg": "'/usr/bin/apt-get dist-upgrade ' failed: E: Sub-process /usr/bin/dpkg returned an error code (1)\n", "rc": 100, "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nCalculating upgrade...\nThe following packages were automatically installed and are no longer required:\n  libtinyxml2-6 linux-modules-4.15.0-29-generic\nUse 'sudo apt autoremove' to remove them.\n0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.\n1 not fully installed or removed.\nAfter this operation, 0 B of additional disk space will be used.\nSetting up rabbitmq-server (3.12.2-1) ...\r\nJob for rabbitmq-server.service failed because the control process exited with error code.\r\nSee \"systemctl status rabbitmq-server.service\" and \"journalctl -xe\" for details.\r\ninvoke-rc.d: initscript rabbitmq-server, action \"restart\" failed.\r\n● rabbitmq-server.service - RabbitMQ broker\r\n   Loaded: loaded (/lib/…
fatal: [figgy-web-staging-1.princeton.edu]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: W:https://dl.yarnpkg.com/debian/dists/stable/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details., W:http://ppa.launchpad.net/ubuntugis/ppa/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details., W:https://oss-binaries.phusionpassenger.com/apt/passenger/dists/jammy/Release.gpg: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details., W:http://ppa.launchpad.net/rabbitmq/rabbitmq-erlang/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details., E:Repository 'http://ppa.launchpad.net/rabbitmq/rabbitmq-erlang/ubuntu jammy InRelease' changed its 'Label' value from…```

Unreachable:

fatal: [lib-postgres-staging2.princeton.edu]: UNREACHABLE! => {. . . "ssh: Could not resolve hostname lib-postgres-staging2.princeton.edu: Name or service not known\r\n" . . . }
fatal: [prds-dataspace-staging-endpoint1]: UNREACHABLE! => {. . . "ssh: pulsys@###.###.###.###: Permission denied (publickey).\r\n" . . . }
fatal: [bastion-staging.pulcloud.io]: UNREACHABLE! => {. . . "ssh: connect to host bastion-staging.pulcloud.io port 22: Connection timed out\r\n" . . .}
fatal: [staging-bastion.pulcloud.io]: UNREACHABLE! => {. . . "Could not resolve hostname staging-bastion.pulcloud.io: Name or service not known\r\n" . . . }
fatal: [smoketest.docnow.io]: UNREACHABLE! => {. . . "ssh: pulsys@smoketest.docnow.io: Permission denied (publickey).\r\n" . . .}
fatal: [fpul-staging2.princeton.edu]: UNREACHABLE! => {. . . ssh: connect to host fpul-staging2.princeton.edu port 22: Connection timed out\r\n" . . .}
acozine commented 1 year ago

Updated the playbook in #4145. Now the Tower the job runs against 86 hosts with 7 unreachable and 8 failed hosts. Additional hosts are:

unreachable:

fatal: [cdh-dev-labs1.princeton.edu]: UNREACHABLE! => {. . . "ssh: Could not resolve hostname cdh-dev-labs1.princeton.edu: Name or service not known\r\n". . .}

and failed:

fatal: [fpul-staging1.princeton.edu]: FAILED! => {"changed": false, "elapsed": 635, "msg": "Timed out waiting for last boot time check (timeout=600)", "rebooted": true}
fatal: [oawaiver-staging1.princeton.edu]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}
fatal: [ojs-staging1.princeton.edu]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}
fatal: [openbooks-staging1.princeton.edu]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}
fatal: [ouranos-staging1.princeton.edu]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}
acozine commented 1 year ago

For hosts that require access through a jump host, we could try an approach like this: https://www.jeffgeerling.com/blog/2022/using-ansible-playbook-ssh-bastion-jump-host

kayiwa commented 7 months ago

this link on cdh-ansible allows connection to the cloud vms

acozine commented 7 months ago

Current failing hosts - staging:

bastion-staging.pulcloud.io
pdc-globus-staging-postcuration   
pdc-globus-staging-precuration   
prds-dataspace-staging-endpoint1 

prod:

bastion-prod.pulcloud.io
cdh-derrida1.princeton.edu
cdh-derrida-crawl1.princeton.edu
community.docnow.io
lib-approvals-prod1.princeton.edu
libimages2.princeton.edu
pdc-globus-prod-postcuration
pdc-globus-prod-precuration
prds-dataspace-dtn1
prod.pulcloud.io
pulmirror.princeton.edu
acozine commented 7 months ago

In qa, the only warning I got was [WARNING]: Could not match supplied host pattern, ignoring: cdh_shared_qa

acozine commented 7 months ago

With the most recent changes to the repo, and nothing in my .ssh/config file relating to using the bastion hosts, Ansible can now connect to GCP hosts from my laptop. The playbook fails, but I don't get unreachable any more:

% ansible-playbook playbooks/os_updates.yml --limit dspace_staging

PLAY [update the Operating System packages] *****************************************************************************************************

TASK [Gathering Facts] **************************************************************************************************************************
ok: [gcp_oar_staging1]
ok: [gcp_dataspace_staging1]

TASK [Ubuntu | refresh keys] ********************************************************************************************************************
changed: [gcp_oar_staging1]
changed: [gcp_dataspace_staging1]

TASK [Ubuntu | Upgrade all packages] ************************************************************************************************************
fatal: [gcp_dataspace_staging1]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}
fatal: [gcp_oar_staging1]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}

PLAY RECAP **************************************************************************************************************************************
gcp_dataspace_staging1     : ok=2    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
gcp_oar_staging1           : ok=2    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

However, from Tower those same hosts still report UNREACHABLE - see https://ansible-tower.princeton.edu/#/jobs/playbook/98/output.

acozine commented 5 months ago

Current failures in Tower:

production: https://ansible-tower.princeton.edu/#/jobs/playbook/865/output staging: https://ansible-tower.princeton.edu/#/jobs/playbook/861/output

VickieKarasic commented 5 months ago

This PR updates much of our inventory: https://github.com/pulibrary/princeton_ansible/pull/4844/

acozine commented 5 months ago

A combination of @VickieKarasic's PR and manually adding the Tower keys to a few strays fixed this. All hosts in the staging, production, and qa groups are now reachable from Tower.