mitogen-hq / mitogen

Distributed self-replicating programs in Python
https://mitogen.networkgenomics.com/
BSD 3-Clause "New" or "Revised" License
2.29k stars 198 forks source link

SSH connect failures on Mitogen 0.2.9 on WSL Ubuntu 18.04 #681

Open gchaix opened 4 years ago

gchaix commented 4 years ago

I'm seeing consistent failures when trying to connect via SSH when multiple hosts are specified in the inventory:

TASK [Gathering Facts] **********************************************************************************************************************************************ERROR! [mux  15260] 10:54:20.330539 E mitogen: <Stream ssh.stage-web1 #6e10> crashed 
Traceback (most recent call last):
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 2033, in write 
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web1]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ok: [stage-web2] 

One host connects, all of the host connections other fail. If there are more than two hosts in the inventory, all but one fail with the same errors. Repeated runs show that the host that fails appears to be random.

PLAY RECAP **********************************************************************************************************************************************************
prod-solr1         : ok=0    changed=0    unreachable=1    failed=0
prod-solr2        : ok=0    changed=0    unreachable=1    failed=0
prod-solr3         : ok=0    changed=0    unreachable=1    failed=0    
prod-util1 : ok=8    changed=0    unreachable=0    failed=0
prod-web1          : ok=0    changed=0    unreachable=1    failed=0
prod-web2          : ok=0    changed=0    unreachable=1    failed=0
prod-web3          : ok=0    changed=0    unreachable=1    failed=0

Environment: Mitogen 0.2.9 Windows 10 Pro, V. 1809, OS build 17763.914 WSL Ubuntu 18.04.3 LTS ansible 2.7.11 config file = /home/gchaix/repos/xxx/ansible/ansible.cfg configured module search path = [u'/home/gchaix/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /home/gchaix/.local/lib/python2.7/site-packages/ansible executable location = /home/gchaix/.local/bin/ansible python version = 2.7.15+ (default, Oct 7 2019, 17:39:04) [GCC 7.4.0] Host target OS is generally CentOS 7.x but this also appears to be happening with other distros (Ubuntu, etc.)

No patches on Ansible or Mitogen. I tried running it with Mitogen current master, same behavior. This feels like it might be related to #319 but I'm not familiar enough with the internals of WSL to really say for certain. Interestingly, running Ansible with -vvv seems to bypass the issue, as all host connections succeed, whereas running with just --verbose produces failure and the output above.

atoom commented 4 years ago

Hi,

We are experiencing the exact same issue when running a playbook in WSL with Ubuntu over multiple hosts. There are no issues when running a playbook with a single host or when running with -vvv over multiple hosts.

Edit: Running with MITOGEN_ROUTER_DEBUG=1 also "solves" the problem without having to use -vvv but leaves a log file behind on each target host.

I would gladly help out with additional troubleshooting but I need some pointers on where to start.

Environment: WSL/Ubuntu: Ubuntu 18.04.1 LTS Windows 10 V. 1809, OS build 18363.592 Ansible: 2.9.4 Mitogen: 0.2.9

konstantin-kornienko commented 4 years ago

Same thing (

kevinvalk commented 4 years ago

Same here, single connection works fine (--limit single host), else I get the same error.

Using WSL1 Debian Buster

s1113950 commented 4 years ago

Could someone try latest master again? I don't have a WSL env to test with unfortunately :( I have noticed other unrelated tasks have failed though with different amounts of -v applied; perhaps it's a bigger issue than specifically WSL-related 🤔

gchaix commented 4 years ago

I'm still seeing failures on master @ a5fe4a9f

ansible-playbook 2.9.6
  config file = /home/gchaix/repos/project/ansible/ansible.cfg
  configured module search path = [u'/home/gchaix/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /home/gchaix/.local/lib/python2.7/site-packages/ansible
  executable location = /home/gchaix/.local/bin/ansible-playbook
  python version = 2.7.17 (default, Apr 15 2020, 17:20:14) [GCC 7.5.0]
Using /home/gchaix/repos/project/ansible/ansible.cfg as config file
TASK [Gathering Facts] *******************************************************************************************************************************************************************ERROR! [mux  734] 12:05:11.015470 E mitogen: <Stream ssh.stage-web2.bak #1050> crashed
Traceback (most recent call last):
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web2.bak]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ERROR! [mux  734] 12:05:11.303791 E mitogen: <Stream ssh.stage-web1.bak #b8d0> crashed
Traceback (most recent call last):
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web1.bak]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ok: [prod-util1.bak]
s1113950 commented 4 years ago

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

arnemorten commented 4 years ago

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

You can probably run the azure devops agent inside a WSL instance and use that as the agent pool in your devops pipeline.

rdghickman commented 4 years ago

Also reproducible for me most of the time, it seems much more prone to doing it on "copy" tasks for some reason.

I'm surprised because it was all working fine a while ago, so I suspect WSL has updated or something.

If I can help with any debug details let me know and I will try.

s1113950 commented 4 years ago

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

You can probably run the azure devops agent inside a WSL instance and use that as the agent pool in your devops pipeline.

We'd need a WSL instance for that right? 🤔 is there an OSS-supported test env (like Travis, Circle, Azure devops, etc) that offer WSL instances?

s1113950 commented 4 years ago

Also reproducible for me most of the time, it seems much more prone to doing it on "copy" tasks for some reason.

I'm surprised because it was all working fine a while ago, so I suspect WSL has updated or something.

If I can help with any debug details let me know and I will try.

I wonder if WSL added a timeout on connection or something? 🤔 The error of the respondent Context has disconnected is reflecting that the connection was broken somehow. Did it work for WSL1 but not WSL2?

gchaix commented 4 years ago

I'm still on WSL1 and definitely seeing the problem. Sadly, I don't know of any test envs that provide WSL instances to test.

s1113950 commented 4 years ago

Could it be due to an ssh timeout error maybe? I found https://www.reddit.com/r/bashonubuntuonwindows/comments/bj617c/how_to_keep_wsl_shell_open_when_ssh_session/ . Wild shot in the dark but if it used to work with the same code and now doesn't then maybe WSL changed their default ssh session connection time?

gchaix commented 4 years ago

I'll dig through the linked post and do some experimenting but an initial look through it doesn't seem to apply, as there is no delay at all between the success and failures. One - and only one - random machine always succeeds and the others immediately fail. It feels more like when it is trying to open a bunch of SSH connections in parallel but only one is being allowed, the rest are immediately rejected by the underlying subsystems (networking maybe?). It's important to note that for me, at least, I'm not sure it ever worked properly. I don't think I tried connecting to an inventory with multiple hosts on WSL before encountering this problem.

s1113950 commented 4 years ago

Ok. I'm not too sure why the underlying subsystems would be rejecting the other connections 😞 maybe @dw knows? He fixed WSL stuff last time: https://github.com/dw/mitogen/commit/22bab87821a02ed8cb6b3eb4b52c766a8f5cfee7 and https://github.com/dw/mitogen/commit/56943d3141c95a25b376d4dcfe01741d22f78bdf . I do see other ssh-related WSL issues have been filed in the past: https://github.com/microsoft/WSL/issues/3503, not sure if relevant though.

rdghickman commented 4 years ago

Just as an additional point, I am seeing the failures and I am only targeting a single host. I agree it seems like a very quick failure.

rdghickman commented 3 years ago

Anyone tried WSL2 yet with this?

asantoni commented 3 years ago

Just to chime in with a possible workaround, I was able to work around this by disabling the Windows Defender firewall. I'm not sure why that solves it. All prior steps in the playbook execute successfully. I can also confirm the LAN IP the playbook was run against is accessible with both the firewall on and off.

The task in the playbook is:

- name: Upload redacted package
  copy:
    dest: "/tmp/"
    src: "{{ latest_redacted_builds[ansible_distribution][ansible_distribution_major_version] }}"
    backup: yes
    owner: root
    group: root
  register: redacted_upload
  tags: [config, redacted-binary]

And the backtrace from the failed execution of the task is:

TASK [redacted: Upload redacted package] ****************************************************************************************************
ERROR! [mux  4321] 13:30:16.461182 E mitogen: <Stream ssh.192.168.122.236 #7c10> crashed
Traceback (most recent call last):
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 3481, in _call
    func(self)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [192.168.122.236]: UNREACHABLE! => {
    "changed": false,
    "unreachable": true
}

My platform is WSL1 with Ubuntu 18.04.3 LTS, on Windows 10 1904.985.

ginolegigot commented 2 years ago

Hello, Same issue here on a more recent config with WSL 1 and Ubuntu 20.04. Tested with mitogen tag v2.10rc1 (also tested 0.2.9 unsuccessfully). An example of error message here:

bugmitogen1

Like the others, -vvv option works well, but without it mitogen will choose one host to perform ansible tasks execution. Hope it helps