mitogen-hq / mitogen

Distributed self-replicating programs in Python
https://mitogen.networkgenomics.com/
BSD 3-Clause "New" or "Revised" License
2.34k stars 199 forks source link

Mitogen (Ansible) completely broken on CentOS 7 #680

Open Gaibhne opened 4 years ago

Gaibhne commented 4 years ago

Sorry for the somewhat sensational title, but there's really no other way to put it. It simply does not work, and gives no helpful error, nor anything I could debug. On all our targets, I get the following when trying to run Ansible with Mitogen enabled:

fatal: [hostname]: UNREACHABLE! => {"changed": false, "msg": "EOF on stream; last 100 lines received:\nbash: auto_silent: command not found", "unreachable": true}

Google gives nothing for the error. I have tried rebooting.

ansible 2.9.2 and mitogen-0.2.9.

No.

Not that I know of.

I don't know. It is not clear to me how I would go about building it; there seems to be no documentation, and I am not familiar with the ecosystem used. I have tried linking to the contained ansible_mitogen/plugins/strategy but that did not fix the issue.

No, other than that all my target hosts are CentOS 7 machines. I have included the debug output of mitogen_get_stack that includes the 'auto_silent' string, but I don't understand the significance:

- /stuff/ansible · master * →  ANSIBLE_STRATEGY=mitogen_linear ansible -m mitogen_get_stack -b -i production.yml horus --private-key sshkeys/ansible.key
horus | CHANGED => {
    "changed": true,
    "result": [
        {
            "kwargs": {
                "check_host_keys": "enforce",
                "compression": true,
                "connect_timeout": 10,
                "hostname": "<internal ip>",
                "identities_only": false,
                "identity_file": "sshkeys/ansible.key",
                "keepalive_count": 10,
                "keepalive_interval": 30,
                "password": null,
                "port": null,
                "python_path": [
                    "auto_silent"
                ],
                "remote_name": null,
                "ssh_args": [
                    "-C",
                    "-o",
                    "ControlMaster=auto",
                    "-o",
                    "ControlPersist=60s"
                ],
                "ssh_debug_level": null,
                "ssh_path": "ssh",
                "username": "ansible"
            },
            "method": "ssh"
        },
        {
            "enable_lru": true,
            "kwargs": {
                "connect_timeout": 10,
                "password": null,
                "python_path": [
                    "auto_silent"
                ],
                "remote_name": null,
                "sudo_args": [
                    "-H",
                    "-S",
                    "-n"
                ],
                "sudo_path": "sudo",
                "username": "root"
            },
            "method": "sudo"
        }
    ]
}

CentOS 7, host and targets.

Python 2.7.5 on all machines

s1113950 commented 4 years ago

auto_silent is an ansible_python_interpreter value related to interpreter discovery, which Mitogen currently doesn't support. @Gaibhne can you please try my patch here: https://github.com/dw/mitogen/pull/658 and let me know if it works for you? It supports auto_silent as the python interpreter.

Gaibhne commented 4 years ago

@s1113950 thanks for your reply. I tried it, and it worked ... somewhat. Generally, it seems to work, but I get errors on about 25% of my hosts (randomly, it seems; the same host will sometimes work and sometimes crash). I am attaching two crash logs of two different servers (both of which worked fine in other runs). The problem seems to mostly (always ?) occur in the 'Gathering Facts' phase.

ERROR! [mux  16864] 09:43:27.685369 E mitogen: <Stream ssh.omnibus.company.com #fcd0> crashed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 3481, in _call
    func(self)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [omnibus]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}

And:

ERROR! [mux  17684] 09:54:50.278387 E mitogen: <Stream ssh.10.100.1.60 #cfd0> crashed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 3481, in _call
    func(self)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [alkoholix]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
s1113950 commented 4 years ago

Ok! It's a start 🤔 can you dump more output from a run with -vvv? It's not immediately clear to me why it would work sometimes and not other times

s1113950 commented 4 years ago

I made more tweaks to my patch to make interpreter discovery smarter. Can you try it again and see if it works 100% of the time @Gaibhne ?

s1113950 commented 4 years ago

I think I closed this prematurely, sorry about that. @Gaibhne please give latest master of mitogen another try and see if your issue still persists

Gaibhne commented 4 years ago

Unfortunately you are right. I just tried with a5fe4a9fac5561511b676fe61ed127b732be5b12 which is the current master and got the following result - during gather facts, and now it seems all hosts are completely broken. Additionally, the process hangs forever after all the errors:

ERROR! [mux  775] 12:25:54.972593 E mitogen: <Stream ssh.omnibus.company.com #24d0> crashed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 3481, in _call
    func(self)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [omnibus]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
s1113950 commented 4 years ago

Can you try with ansible version 2.8.8? Ansible 2.9+ isn't fully supported yet

Gaibhne commented 4 years ago

Just tried with 2.8.8, no joy:

ERROR! [mux  1618] 21:16:44.610331 E mitogen: <Stream ssh.omnibus.company.com #1750> crashed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 3481, in _call
    func(self)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/usr/lib/python2.7/site-packages/mitogen-0.2.9-py2.7.egg/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [omnibus]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
s1113950 commented 4 years ago

Shoot, ok. Can you post a minimally-reproducible playbook for me to play with? I test inside Centos7 docker images and things work for me

s1113950 commented 4 years ago

Also, for sure it's not related to a proxy or anything? Can you connect to all the machines manually?

Gaibhne commented 4 years ago

Since it happens even during the "Gathering Facts" step, I suspect no aspects of the playbooks themselves are responsible. I have two more observations that I think may be helpful:

I have an output with -vv, which still produced the problems, as well as one with -vvv that runs without problems, but I don't feel comfortable posting so much data in public. Is there a mechanism to provide you with the dumps that is non-public ?

s1113950 commented 4 years ago

hmmm interesting, I remember that happening to me before as well for something unrelated (where I added a different amount of -v and it worked).

You can make a private repo of the dump and then invite me to it :)

awerner25 commented 4 years ago

I have the same issue with centos 7 servers. The problem is not in my playbooks.

If I just launch ansible all -i hosts -m ping -v:

ansible 2.7.10
  config file = /usr/local/ansible/ansible.cfg
  configured module search path = [u'/home/user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Aug  7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]

I have the last version of mitogen.

My ansible.cfg:

[defaults]
inventory      = hosts
debug = dark gray
gathering = smart
fact_caching = jsonfile
fact_caching_connection = ./tmp/
strategy_plugins = ./mitogen/ansible_mitogen/plugins/strategy
strategy = mitogen_linear
callback_plugins=/usr/lib/python2.7/site-packages/ara/plugins/callbacks
action_plugins=/usr/lib/python2.7/site-packages/ara/plugins/actions

[ara]
ARA_DIR=/DATA/ara
ARA_HOST=0.0.0.0

[colors]
verbose = bright blue

[ssh_connection]
scp_if_ssh = True
transfer_method = scp
sftp_batch_mode = False
fauust commented 4 years ago

Hi! I am not sure if that helps and I am catching the train a bit late. But I had the same random problem and I resolved it by specifying the ansible_python_interpreter in my hosts file:

[all:vars]
ansible_python_interpreter=/usr/bin/python3

[db]
mariadb01 ansible_python_interpreter=/usr/bin/python
mariadb02

[www]
www01

Hope that helps!

awerner25 commented 4 years ago

Hello,

I try your workaround fauust but it change nothing for me.

Gaibhne commented 4 years ago

I can finally contribute something new, @s1113950! Today, I tried a fresh pull, since I figured maybe something committed in the meantime fixed it, Lo and behold, it worked just fine! However, just as I was about to close this ticket, I had to ctrl-c a running playbook, and right after that, it started happening again.

Hopefully that little detail helps narrow down the scope of the issue.

Gaibhne commented 4 years ago

I have recently switched to a new CentOS 8 VM and it is not happening there. If nothing else, this proves that the problem definitely is not on the target computers, as I am using the same playbooks and inventory. As I can no longer reproduce this issue, I'm leaving it up to maintainers to close it if you want.