mitogen-hq / mitogen

Distributed self-replicating programs in Python
https://mitogen.networkgenomics.com/
BSD 3-Clause "New" or "Revised" License
2.32k stars 197 forks source link

delegate_facts from a container's host to a container creates a cycle #584

Open jimmymccrory opened 5 years ago

jimmymccrory commented 5 years ago

I'm helping with the OpenStack-Ansible integration and we're currently running into a error when a host tries to delegate its facts to a container that it's hosting.

I've been able to reproduce this with a much simpler playbook and inventory than an OpenStack-Ansible deployment would be using.

playbook

---
- hosts: localhost
  gather_facts: no
  tasks:
    - name: install lxc
      apt:
        name: python-lxc,lxc,lxc-templates
        state: present
    - name: create a container
      lxc_container:
        name: test-container
        state: started
        template: ubuntu
        template_options: --packages python
    - name: gather and delegate facts
      setup:
        gather_subset: '!all:hardware'
      delegate_to: test-container
      delegate_facts: true

inventory

all:
  hosts:
    test-container:
      ansible_hostname: test-container
      mitogen_container_name: test-container
      mitogen_kind: lxc
      mitogen_via: localhost
      ansible_connection: setns
TASK [gather and delegate facts] *******************************************************************************************************************************************************************************************************
task path: /tmp/test.yml:15
[task 26394] 11:27:07.272509 D mitogen: unix.connect(path='/tmp/mitogen_unix_etAT4g.sock')
[task 26394] 11:27:07.273311 D mitogen: unix.connect(): local ID is 1004, remote is 0
[mux  26323] 11:27:07.273296 D mitogen: mitogen.unix.Listener('/tmp/mitogen_unix_etAT4g.sock'): accepted mitogen.core.Stream('unix_client.26394')
[task 26394] 11:27:07.278025 D mitogen: mitogen.core.Stream('unix_listener.26323').on_disconnect()
[mux  26323] 11:27:07.278400 D mitogen: mitogen.core.Stream('unix_client.26394').on_disconnect()
[task 26394] 11:27:07.278428 D mitogen: Waker(Broker(0x7f21e3ea6dd0) rfd=9, wfd=10).on_disconnect()
[task 26394] 11:27:07.278958 D mitogen: Router(Broker(0x7f21e3ea6dd0)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
fatal: [localhost]: UNREACHABLE! => {
    "changed": false,
    "msg": "mitogen_via=None of localhost creates a cycle (localhost -> localhost)",
    "unreachable": true
}
    to retry, use: --limit @/tmp/test.retry

Both the host and container are running Ubuntu 18.04.2 and Python 2.7.15rc1

ansible 2.7.10
  config file = /root/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
jrosser commented 3 years ago

@s1113950 do you have any ideas on this, we've been trying every so often to get openstack-ansible and mitogen working together and this looks like one of the final remaining issues we have.

s1113950 commented 3 years ago

At first glance I'm not sure, but here's how I run Mitogen using delegate_to pointing at a container that I created in the playbook run (note that I don't use lxc_container though): https://github.com/s1113950/mitogen-test/blob/7a39ef020712e8ff225a3343d72b56c96d71382a/roles/run_test/tasks/main.yml#L64 . If possible, you could also try upgrading to one of the new Mitogen tags (https://github.com/mitogen-hq/mitogen/releases/tag/v0.2.10-rc.0 if you need Ansible 2.7 support) to see if the issue still exists.

GeorginaShippey commented 3 years ago

So I've recreated this issue using ansible 2.10.6 and mitogen v0.3.0rc1 with docker using a mishmash of the ansible jimmymcrory provided and the mitogen-test repo linked.

In Openstack-Ansible we have CI jobs that essentially spin up a small cloud on a single physical host with the various cloud services in lxc containers on the single host, giving us our usecase of why it would be good to test mitogen_via the localhost.

The code I have most closely been looking at is the _stack_from_spec method in connections.py, specifically the lines around cycle detection. https://github.com/mitogen-hq/mitogen/blob/cc8f9a016965876bcd9ec390d53035d6ed842b07/ansible_mitogen/connection.py#L734

Having edited in a couple of debug log print statements we can see some more detailed output of what is happening when we try to delegate a task to a container on the localhost, using mitogen_via=localhost.

TASK [gather and delegate facts] ***************************************************************************************************************************
task path: /home/ubuntu/mitogen-delegation-bug/docker-reproduce-bug.yml:36
[task 123955] 12:36:29.856573 D ansible_mitogen.affinity: CPU mask for WorkerProcess: 0x000001
[task 123955] 12:36:29.863105 D ansible_mitogen.connection: In _stack_from_spec, spec.inventory_name: localhost, seen_names: (), spec.mitogen_via: localhost
[task 123955] 12:36:29.863280 D ansible_mitogen.connection: Calling _stack_from_spec(spec_from_via))
[task 123955] 12:36:29.867411 D ansible_mitogen.connection: In _stack_from_spec, spec.inventory_name: localhost, seen_names: ('localhost',), spec.mitogen_via: None
[task 123955] 12:36:29.867745 D ansible_mitogen.mixins: _remove_tmp_path(None)
fatal: [localhost]: UNREACHABLE! => {
    "changed": false,
    "msg": "mitogen_via=None of localhost creates a cycle (localhost -> localhost)",
    "unreachable": true
}

If mitogen_via is true _stack_from_spec is called a second time with the inventory_name being added to the seen_names which in this case are both localhost. https://github.com/mitogen-hq/mitogen/blob/cc8f9a016965876bcd9ec390d53035d6ed842b07/ansible_mitogen/connection.py#L749 This leads to a cycle being detected in _stack_from_spec as localhost is the inventory_name and is in the seen_names.

We feel this does not account for the fact that the task is being delegated to a container. Perhaps this is a unique case in which delegation needs to be detected and the cycle allowed for, however I'm are unsure of how that should be done as I don't think we have access to that information here.

Having removed the cycle detection the play runs through smoothly.

Here are my ansible playbook and Dockerfile if they are of any use: Playbook

---
- hosts: localhost
  gather_facts: no
  vars:
    ansible_python_interpreter: /usr/bin/python3
  tasks:
    - name: stopping any old test container
      docker_container:
        name: docker-test-container
        state: absent
      vars:
        ansible_python_interpreter: /usr/bin/python3

    - name: Wait for container to be stopped
      pause:
        seconds: 2

    - name: create a container
      docker_container:
        name: docker-test-container
        state: started
        image: test-docker-image:latest

    - name: add container to inventory
      add_host:
        name: docker-test-container
        ansible_user: test
        ansible_password: test
        ansible_ssh_port: 22
        ansible_connection: docker
        #ansible_connection: setns
        mitogen_kind: docker
        mitogen_via: localhost

    - name: gather and delegate facts
      setup:
        gather_subset: '!all:hardware'
      delegate_to: docker-test-container
      delegate_facts: true

    - name: delegate a task
      debug:
        msg: "Ansible_host: {{ ansible_host }}"
      delegate_to: docker-test-container
      delegate_facts: true

Dockerfile

FROM ubuntu:20.04
RUN apt-get update && apt-get install -y python3 openssh-server sudo

RUN useradd -rm -d /home/ubuntu -s /bin/bash -g root -G sudo -u 1000 test

RUN  echo 'test:test' | chpasswd

EXPOSE 22

RUN service ssh start

CMD ["/usr/sbin/sshd","-D"]
pmyjavec commented 3 years ago

A little bit of extra feedback on this issue. We were also having a similar problem using the LXD connection plugin, commenting out the cycle detection seems to fix the issue for us too. Commenting out the following if statement means our playbooks work perfectly.

https://github.com/mitogen-hq/mitogen/blob/9d404e0b32af87815a3dfe5ed014d0c5a4b6e07b/ansible_mitogen/connection.py#L734