mitogen-hq / mitogen

Distributed self-replicating programs in Python
https://mitogen.networkgenomics.com/
BSD 3-Clause "New" or "Revised" License
2.32k stars 197 forks source link

ssh_connection retries seems not used #671

Open jcosmao opened 4 years ago

jcosmao commented 4 years ago

It seems mitogen does not retry ssh connection if a host is unreachable. Trying to reproduce issue with an unreachable host and ssh retries setup to 5:

ansible.cfg

[ssh_connection]
retries = 5

Play

- name: test                        
   hosts: host_test
   # strategy: mitogen_linear        
   strategy: linear                  
   gather_facts: False               
   tasks:                            
     - file:                         
         path="/tmp/test"            
         state=touch  

Logs

with strategy: linear

PLAY [test] *******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
META: ran handlers    

TASK [file] *******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
task path: /tmp/test.yml:40    
Thursday 28 November 2019  18:20:31 +0100 (0:00:00.041)       0:00:00.041 *****     
Using module file /usr/lib/python3.5/site-packages/ansible/modules/files/file.py    
Pipelining is enabled.    
<host_test> ESTABLISH SSH CONNECTION FOR USER: root    
<host_test> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/4b4c99a891 host_test '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<host_test> ssh_retry: attempt: 1, ssh return code is 255. cmd ([b'ssh', b'-C', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60s', b'-o', b'StrictHostKeyChecking=no', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="root"', b'-o', b'ConnectTimeout=10', b'-o', b'ControlPath=/root/.ansible/cp/4b4c99a891', b'host_test', b"/bin/sh -c '/usr/bin/python3 && sleep 0'"]...), pausing for 0 seconds
<host_test> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/4b4c99a891 host_test '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<host_test> ssh_retry: attempt: 2, ssh return code is 255. cmd ([b'ssh', b'-C', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60s', b'-o', b'StrictHostKeyChecking=no', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="root"', b'-o', b'ConnectTimeout=10', b'-o', b'ControlPath=/root/.ansible/cp/4b4c99a891', b'host_test', b"/bin/sh -c '/usr/bin/python3 && sleep 0'"]...), pausing for 1 seconds
<host_test> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/4b4c99a891 host_test '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<host_test> ssh_retry: attempt: 3, ssh return code is 255. cmd ([b'ssh', b'-C', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60s', b'-o', b'StrictHostKeyChecking=no', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="root"', b'-o', b'ConnectTimeout=10', b'-o', b'ControlPath=/root/.ansible/cp/4b4c99a891', b'host_test', b"/bin/sh -c '/usr/bin/python3 && sleep 0'"]...), pausing for 3 seconds
<host_test> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/4b4c99a891 host_test '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<host_test> ssh_retry: attempt: 4, ssh return code is 255. cmd ([b'ssh', b'-C', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60s', b'-o', b'StrictHostKeyChecking=no', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="root"', b'-o', b'ConnectTimeout=10', b'-o', b'ControlPath=/root/.ansible/cp/4b4c99a891', b'host_test', b"/bin/sh -c '/usr/bin/python3 && sleep 0'"]...), pausing for 7 seconds
<host_test> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/4b4c99a891 host_test '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<host_test> ssh_retry: attempt: 5, ssh return code is 255. cmd ([b'ssh', b'-C', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60s', b'-o', b'StrictHostKeyChecking=no', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="root"', b'-o', b'ConnectTimeout=10', b'-o', b'ControlPath=/root/.ansible/cp/4b4c99a891', b'host_test', b"/bin/sh -c '/usr/bin/python3 && sleep 0'"]...), pausing for 15 seconds
<host_test> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/4b4c99a891 host_test '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
fatal: [host_test]: UNREACHABLE! => changed=false     
  msg: |-    
    Data could not be sent to remote host "host_test". Make sure this host can be reached over ssh: ssh: connect to host host_test port 22: Connection refused    
  unreachable: true    

with strategy: mitogen_linear

PLAYBOOK: test.yml **************************************************************************************************************************************************************************************************************************    
1 plays in /tmp/test.yml    

PLAY [test] *********************************************************************************************************************************************************************************************************************************    
[top  4658] 18:18:40.276317 D ansible_mitogen.affinity: CPU mask for Ansible top-level process: 0x000002    
[top  4658] 18:18:40.277332 D ansible_mitogen.process: inherited open file limits: soft=1024 hard=1048576    
[top  4658] 18:18:40.277877 D ansible_mitogen.process: raised soft open file limit from 1024 to 1048576    
[mux  4666] 18:18:40.284712 D ansible_mitogen.affinity: CPU mask for MuxProcess 0: 0x000001    
[mux  4666] 18:18:40.301380 D mitogen.service: Pool(62b0, size=32, th='MainThread'): initialized    
[mux  4666] 18:18:40.303930 D ansible_mitogen.process: Service pool configured: size=32    
META: ran handlers    

TASK [file] *********************************************************************************************************************************************************************************************************************************    
task path: /tmp/test.yml:40    
Thursday 28 November 2019  18:18:40 +0100 (0:00:00.098)       0:00:00.098 *****    
[task 4700] 18:18:40.332803 D ansible_mitogen.affinity: CPU mask for WorkerProcess: 0x000004    
[task 4700] 18:18:40.359046 D ansible_mitogen.process: will use multiplexer 0 (/tmp/mitogen_unix_peubdg0l.sock) to connect to "host_test"    
[task 4700] 18:18:40.359485 D mitogen.unix: client: connecting to /tmp/mitogen_unix_peubdg0l.sock    
[task 4700] 18:18:40.360498 D mitogen.unix: client: local ID is 1, remote is 0    
[mux  4666] 18:18:40.360818 D mitogen.unix: listener: accepted connection from PID 4700: unix_client.4700    
[mux  4666] 18:18:40.366643 D mitogen.parent: creating connection to context 2 using mitogen.ssh    
[mux  4666] 18:18:40.519155 D mitogen.parent: command line for Connection(None): ssh -o "LogLevel ERROR" -l root -o "Compression yes" -o "ServerAliveInterval 30" -o "ServerAliveCountMax 10" -o "BatchMode yes" -o "StrictHostKeyChecking no" -o "UserKnownHostsFile /dev/null" -o "GlobalKnownHostsFile /dev/null" -C -o ControlMaster=auto -o ControlPersist=60s host_test /usr/bin/python3 -c "'import codecs,os,sys;_=codecs.decode;exec(_(_(\"eNqFkcFLwzAUxs/rX9FbEha6dJMOCgFlB/EgQhF30CHtkm7BLglp2jr/et86ce08eHs/3vfe9+Ulo2tu6sgqKzEJHO0GpMoQoDTuA5M0mEAtGjvHjMaMkQtndEgOuvGZt5WpJc6G4IawHkIHAIb1Eeyr3IPrIeQ8RCJ3ndIozLXom/JTbhufF5Xs27OmdrNC6Zk9+r3RCHJOrmRT3g+20tXK6Nd0seltpW6VA0Z32f0LQxs+HjtrACs8btAxThE+KG92UqfOGH+bi4PSi2hXmSKvIiHb0/saEZl2H2np05skSQgiAWzvnPISxxQ9Pjw/McbeNIJUWyPg+CRY8Xd8Or8wVmo4OnIFIpGTucDxks2XhKIvZWFTaflFt6aoK9DpR0r7Y7Dq6/OVr9Tdf+q/KeNxyt//m5Nvc4q5XA==\".encode(),\"base64\"),\"zip\"))'"
[mux  4666] 18:18:40.531862 D mitogen.parent: child for Connection(None) started: pid:4703 stdin:77 stdout:77 stderr:79    
[mux  4666] 18:18:40.618116 D mitogen.parent: ssh.host_test: (unrecognized): ssh: connect to host host_test port 22: Connection refused    
[mux  4666] 18:18:40.619316 D mitogen: <Side of ssh.host_test fd 79>: empty read, disconnecting    
[mux  4666] 18:18:40.621030 D mitogen: SetupProtocol(ssh.host_test): disconnecting    
[mux  4666] 18:18:40.622383 D mitogen: <Side of ssh.host_test fd 77>: empty read, disconnecting    
[mux  4666] 18:18:40.623188 D mitogen.parent: failing connection ssh.host_test due to EofError('EOF on stream; last 100 lines received:\nssh: connect to host host_test port 22: Connection refused\r',)    
[mux  4666] 18:18:40.623931 D mitogen.parent: PopenProcess ssh.host_test pid 4703: exited with return code 255    
[mux  4666] 18:18:40.624371 D mitogen: BootstrapProtocol(ssh.host_test): disconnecting    
[mux  4666] 18:18:40.625550 D mitogen: BootstrapProtocol(ssh.host_test): disconnecting    
[task 4700] 18:18:40.627460 D mitogen: MitogenProtocol(unix_listener.4666): disconnecting    
[mux  4666] 18:18:40.628798 D mitogen: <Side of unix_client.4700 fd 76>: empty read, disconnecting
[task 4700] 18:18:40.628844 D mitogen: Waker(fd=11/12): disconnecting
[mux  4666] 18:18:40.629390 D mitogen: MitogenProtocol(unix_client.4700): disconnecting
[task 4700] 18:18:40.630188 D mitogen: Router(Broker(9898)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
fatal: [host_test]: UNREACHABLE! => changed=false 
  msg: |-
    EOF on stream; last 100 lines received:
    ssh: connect to host host_test port 22: Connection refused
  unreachable: true
mbunkus commented 4 years ago

I just ran into this, too. I'd really like to see support for SSH retries. I often run playbooks with tons of hosts, and often enough the first SSH connection to some of them fails (for whatever reason). Using Ansible's default SSH connection plugin with retries solves this as the second connection attempt works for those few hosts, but with the mitogen plugin no retries are made.

I'd be very grateful if support for ANSIBLE_SSH_RETRIES / [ssh_connection] retries=… was implemented.

lancefriday945 commented 3 years ago

Have there been any work arounds or updates on this issue? In our use case we have connection issues based on ping and would like to retry some of those hosts as well. Prior to using mitogen, we weren't experiencing this issue.

nhooey commented 3 years ago

I'm seeing the same thing with none of this configuration not being used:

[ssh_connection]
ansible_ssh_timeout = 60s
ssh_args=-o PasswordAuthentication=no -o ConnectTimeout=60s -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r

Nevermind, that configuration does get used, but I get this error:

[mux  787] 03:25:53.163826 D mitogen.parent: failing connection ssh.importer.staging.rsrchx.local
due to TimeoutError('Failed to setup connection after 10.00 seconds')

How do you set the Connection.options.connect_timeout value that's used in Connection._on_timer_expired() as shown in the above error message?