hashicorp / packer-plugin-ansible

Packer plugin for Ansible Provisioner
https://www.packer.io/docs/provisioners/ansible
Mozilla Public License 2.0
47 stars 36 forks source link

Installing dnf packages causes connection loss #133

Closed max-wittig closed 1 year ago

max-wittig commented 1 year ago

Overview of the Issue

When trying to install dnf packages with Ansible, Ansible will lose connection when used with packer

Reproduction Steps

  1. Use the latest Fedora AMI with packer on AWS
  2. Try to install packages inside an Ansible playbook
  3. Observe Shared connection to 127.0.0.1 closed

Plugin and Packer version

From packer version

Docker image: hashicorp/packer:1.8.3 or 1.8.5

Simplified Packer Buildfile

build {
  sources = ["source.amazon-ebs.x86_64", "source.amazon-ebs.aarch64"]

  provisioner "ansible" {
    playbook_file = "01-osbuild.yml"
    user          = "fedora"
    ansible_ssh_extra_args = [
      "-oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedKeyTypes=+ssh-rsa"
    ] # https://github.com/hashicorp/packer-plugin-ansible/issues/69
    extra_arguments = [
      "--scp-extra-args", "'-O'", # https://github.com/eschercloudai/image-builder/commit/81fa794be4a9f353ebb8c2424143c5fb9d7cd5cf
    ]

- hosts: all
  become: yes
  tasks:
    - name: Machine Setup | Install general packages
      dnf:
         name:
           - vim-enhanced # updated and improved version of the vi editor
            - curl # tool to transfer data from or to a server
            - policycoreutils # policy core utilities
            - patch # apply a diff file to an original
            - rsync # File-Syncing
            - iotop # Monitor io activity
            - ncdu # Show disk usage interactively
            - htop # An interactive process viewer for Unix
            - lsof # list open files
            - jq # commandline JSON processor
            - chkconfig # Fedora systemd shim
            - iptables # Tools for managing Linux kernel packet filtering capabilities
            - openssl # Needed for certificate conversion in code-root-ca role
        state: installed

# .gitlab-ci.yml file
osbuild:
  image:
    entrypoint: [""]
    name: hashicorp/packer:${packer_version}
  before_script:
    - apk add --no-cache --quiet git ansible openssh openssh-sftp-server git-lfs
    - packer build .

Operating system and Environment details

Fedora 36 (x86_64 & aarch64), trying to be build with the packer Docker image inside Gitlab CI

Log Fragments and crash.log files

    amazon-ebs.aarch64: TASK [Gathering Facts] *********************************************************
    amazon-ebs.aarch64: task path: /builds/code-ops/gitlab-ci-4-linux/02-osbuild/01-osbuild.yml:2
    amazon-ebs.aarch64: <127.0.0.1> ESTABLISH SSH CONNECTION FOR USER: fedora

amazon-ebs.aarch64: TASK [../code/roles/code-common : Machine Setup | Install general packages] ****
    amazon-ebs.aarch64: task path: /builds/code-ops/gitlab-ci-4-linux/code/roles/code-common/tasks/machine-setup.yml:53
    amazon-ebs.aarch64: <127.0.0.1> ESTABLISH SSH CONNECTION FOR USER: fedora
    amazon-ebs.aarch64: <127.0.0.1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=43179 -o 'IdentityFile="/tmp/ansible-key20 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="fedora"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ConnectionAttempts=50 -o ControlMaster=auto -o ControlPersist=600s -oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedKeyTypes=+ssh-rsa -o 'ControlPath="/root/.ansible/cp/15638aec04"' 127.0.0.1 '/bin/sh -c '"'"'echo ~fedora && sleep 0'"'"''
    amazon-ebs.aarch64: <127.0.0.1> (0, b'/home/fedora\n', b'')
    amazon-ebs.aarch64: <127.0.0.1> ESTABLISH SSH CONNECTION FOR USER: fedora
    amazon-ebs.aarch64: <127.0.0.1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=43179 -o 'IdentityFile="/tmp/ansible-key2056693916"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="fedora"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ConnectionAttempts=50 -o ControlMaster=auto -o ControlPersist=600s -oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedKeyTypes=+ssh-rsa -o 'ControlPath="/root/.ansible/cp/15638aec04"' 127.0.0.1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /home/fedora/.ansible/tmp `"&& mkdir "` echo /home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502 `" && echo ansible-tmp-1673600072.1333485-545-108410158722502="` echo /home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502 `" ) && sleep 0'"'"''
    amazon-ebs.aarch64: <127.0.0.1> (0, b'ansible-tmp-1673600072.1333485-545-108410158722502=/home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502\n', b'')
    amazon-ebs.aarch64: Using module file /usr/lib/python3.10/site-packages/ansible/modules/dnf.py
    amazon-ebs.aarch64: <127.0.0.1> PUT /root/.ansible/tmp/ansible-local-504ji6_o1sx/tmpy6ag3ari TO /home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502/AnsiballZ_dnf.py
    amazon-ebs.aarch64: <127.0.0.1> SSH: EXEC scp -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=43179 -o 'IdentityFile="/tmp/ansible-key2056693916"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="fedora"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ConnectionAttempts=50 -o ControlMaster=auto -o ControlPersist=600s -O -o 'ControlPath="/root/.ansible/cp/15638aec04"' /root/.ansible/tmp/ansible-local-504ji6_o1sx/tmpy6ag3ari '[127.0.0.1]:/home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502/AnsiballZ_dnf.py'
    amazon-ebs.aarch64: <127.0.0.1> (0, b'', b'')
    amazon-ebs.aarch64: <127.0.0.1> ESTABLISH SSH CONNECTION FOR USER: fedora
    amazon-ebs.aarch64: <127.0.0.1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=43179 -o 'IdentityFile="/tmp/ansible-key2056693916"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="fedora"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ConnectionAttempts=50 -o ControlMaster=auto -o ControlPersist=600s -oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedKeyTypes=+ssh-rsa -o 'ControlPath="/root/.ansible/cp/15638aec04"' 127.0.0.1 '/bin/sh -c '"'"'chmod u+x /home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502/ /home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502/AnsiballZ_dnf.py && sleep 0'"'"''
    amazon-ebs.aarch64: <127.0.0.1> (0, b'', b'')
    amazon-ebs.aarch64: <127.0.0.1> ESTABLISH SSH CONNECTION FOR USER: fedora
    amazon-ebs.aarch64: <127.0.0.1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=43179 -o 'IdentityFile="/tmp/ansible-key2056693916"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="fedora"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ConnectionAttempts=50 -o ControlMaster=auto -o ControlPersist=600s -oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedKeyTypes=+ssh-rsa -o 'ControlPath="/root/.ansible/cp/15638aec04"' -tt 127.0.0.1 '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-ruuiwqxzngdntznebneegdbojgjhimen ; /usr/bin/python3 /home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502/AnsiballZ_dnf.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''
    amazon-ebs.aarch64: Escalation succeeded
    amazon-ebs.aarch64: <127.0.0.1> (137, b'', b'Shared connection to 127.0.0.1 closed.\r\n')
    amazon-ebs.aarch64: <127.0.0.1> Failed to connect to the host via ssh: Shared connection to 127.0.0.1 closed.
    amazon-ebs.aarch64: <127.0.0.1> ESTABLISH SSH CONNECTION FOR USER: fedora
    amazon-ebs.aarch64: <127.0.0.1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=43179 -o 'IdentityFile="/tmp/ansible-key2056693916"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="fedora"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ConnectionAttempts=50 -o ControlMaster=auto -o ControlPersist=600s -oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedKeyTypes=+ssh-rsa -o 'ControlPath="/root/.ansible/cp/15638aec04"' 127.0.0.1 '/bin/sh -c '"'"'rm -f -r /home/fedora/.ansible/tmp/ansible-tmp-1673600072.1333485-545-108410158722502/ > /dev/null 2>&1 && sleep 0'"'"''
    amazon-ebs.aarch64: <127.0.0.1> (0, b'', b'')
    amazon-ebs.aarch64: fatal: [default]: FAILED! => {
    amazon-ebs.aarch64:     "changed": false,
    amazon-ebs.aarch64:     "module_stderr": "Shared connection to 127.0.0.1 closed.\r\n",
    amazon-ebs.aarch64:     "module_stdout": "",
    amazon-ebs.aarch64:     "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
    amazon-ebs.aarch64:     "rc": 137
    amazon-ebs.aarch64: }
    amazon-ebs.aarch64:
    amazon-ebs.aarch64: PLAY RECAP *********************************************************************
    amazon-ebs.aarch64: default                    : ok=3    changed=2    unreachable=0    failed=1    skipped=2    rescued=0    ignored=0
    amazon-ebs.aarch64:
==> amazon-ebs.aarch64: Provisioning step had errors: Running the cleanup provisioner, if present...

We've tried a lot to work around the issue and also haven't found anyone else with this problem so maybe we're doing something wrong here.

/cc @dlouzan

max-wittig commented 1 year ago

We solved this problem after days of debugging finally. Turns out that Fedora 36 suddenly has a more aggressive memory killer that always killed our dnf install:

See also: https://bugzilla.redhat.com/show_bug.cgi?id=1941170

We're still not sure how this change could have happened, given that we even tried a Fedora 36 AMI from May 2022.

The solution for now was to simply increase the instance size used to build the image from.