Closed berenddeschouwer closed 5 years ago
This is some fabulous debugging on your part, thanks so much! It definitely looks like some kind of early startup crash, because there are no messages after ec0_received suggesting e.g. the child started importing modules. The 'DiagLogStream.on_disconnect()' means the bootstrap had completed entirely, and only the main process remained on the remote - there is no good debug log meaning "we registered the new stream with the io loop", instead you can only see disconnect messages generated by the io loop
Do you have strace available on one of those machines? Is there any possibility you could try out this trick? Otherwise I will start combing the work done in November that you have isolated.
Currently working on some other stuff, but will get to this ASAP. Thanks again
Ah, doh, forget that. The strace won't work because strace can't trace a setuid app like sudo. You'll just get a completely different failure instead. If you're feeling brave, you could use the same 'strace -ff' attached to the top-level SSHd on the box. When it forks to accept the Ansible client, the tracer will eventually be installed on every process including the Python interpreter that is dying. This will produce a ton of files.. easiest way to find a relevant one is 'grep -E execve.*python /tmp/....'
The commit you flagged was painful, if I remember it broke a few things. Will find time to look at all the setup code it touches, its very possibly the culprit.
Note that ticket was also about cleaning up many FD leaks during connect. One case in particular that I definitely recall was having my process killed by the kernel because the controlling side of the child's PTY ended up owned by the parent. This code touched the order in which FDs got opened and closed, and it could be somehow related to the parent's interaction with the master side of the PTY.
[pid 25347] 11:38:38.448557 D mitogen.ctx.ssh.example.com: mitogen: mitogen.sudo.Stream(u'sudo.root'): PID 20978 exited due to signal 1 (SIGHUP)
Hangup! That is very TTY-related. Okay, working theory, just need to find time to look into this. Again thanks a ton for the bisecting, it helps so much
I'm starting to recall a variant of this very issue, because I had the sudoers manpage open scanning for FD-related options very recently. Do you have any strangeish 'Defaults' lines in your sudoers, or maybe an SELinux policy loaded? It would be nice to know why I'm not seeing this here
The only reason sudo could be getting SIGHUP from the kernel is if the parent process closed the master side of the TTY, so I don't think this can be explained by any sudoers options.. it doesn't even receive a copy of the relevant FD
No selinux.
The only potentially relevant sudo option -- I think -- is not-requiretty
It's an old machine. LTS, but old, so it could be a kernel thing. It's also Python 2.6, so I'm trying to see if I can get a different version side-by-side. I want a later Python (3.x) for other reasons anyway; but it's going to take a while.
btw, sudo will quit if run under strace. I did try sudo debug logging, but I got minimal information, so I'd have to compile a debug version.
Its fine, don't worry about strace, SIGHUP really is the smoking gun. The only place that gets sent from in this scenario is by the kernel, and only when the Python bits in the parent close the master FD while any child still has the slave end as its controlling TTY.
Another possibility is that the master FD is being overwritten. There are existing tickets open about dup2() safety in the code, and it's bitten a few times before. Those bugs tend to reveal themselves at strange times, and always manifest by an unrelated component failing
FWIW when I'm debugging this stuff, I find it useful to litter the parent.py connect functions with "os.system('ls- l /proc/%d/fd' % os.getpid())" :) total hack, but the only way to get useful diagnostic info out a the exact moment it's meaningful
Hi Berend :)
Just a few questions to narrow this a bit:
Another idea is that for whatever reason on your config, the sudo invocation is closing the last FD connected to the slave PTY. That causes DiagLogStream.on_disconnect() to fire, which in turn causes the master to close the master side. That would trigger the SIGHUP.
If you're feeling daring, comment out these two lines in parent.py (line ~1330):
if self.diag_stream:
self._router.broker.start_receive(self.diag_stream)
That would prevent it from trying to read from the master TTY, and thus never noticing slave disconnect. The master FD would still be cleaned up as normal during the data FD's on_disconnect().
Hi Berend :)
Just a few questions to narrow this a bit:
* Was the bisect 'bang on the money'? There were quite some interesting cleanups that day * Does this still manifest with latest master? * What's the target OS / Python version / sudo version, I may need to setup a VM
Yes, I checked forward/backwards twice. Yes, it's still a problem with master
It happens reliably on older machines, but not newer machines, so it's possibly kernel/python/sudo version.
I'm collecting more data on where it's a problem, ie. bisect OS.
Broken: python 2.6.8 / old kernel / sudo 1.6.8 (RHEL patched) Working: same python / same kernel / sudo 1.7.2 (RHEL patched) Working: python 2.6.6 / new(er) kernel / new(er) sudo 1.7.4 (RHEL patched) Working: python 2.6.6 / new(er) kernel / new(er) sudo 1.8.6 (RHEL patched) Working: brand new
I think I found my solution. I shouldn't be running sudo 1.6.8 anywhere anyway. 1.7 still received critical fixes.
Hi Berend,
I'd like to reproduce this before release, could you tell me which precise version of RHEL/CentOS the sudo comes from? Otherwise I can try with CentOS 5.11
I have reproduced it on 5.0
I think I have an idea of what's going on.. this bizarre old version of sudo doesn't hang around after it starts the child process. In fact, it overwrites itself with Python.
Python then forks into a 'first stage', which keeps the TTY FD around, while the parent half of the fork re-execs Python to clean the inherited argv.
Parent half setup completes, including by overwriting the inherited TTY FD with an IoLogger to capture spurious output generated by subprocesses, and the first stage exits -- closing the last remaining TTY FD that is open.
thanks a ton for persisting with this!
This is now on the master branch and will make it into the next release. To be updated when a new release is made, subscribe to https://networkgenomics.com/mail/mitogen-announce/
Thanks for reporting this!
Ansible using mitogen_linear results in a cryptic ansible error
"Channel was disconnected while connection attempt was in progress; this may be caused by an abnormal Ansible exit, or due to an unreliable target."
This only happens with --become. This happens after MITO000 is received. So login and initial Python run happens fine.
This started with commit 802de6a8d585fbc24434a993aa0e2bba02920ce1 (using git bisect) for issue #406.
Log
```` ansible 2.7.5 config file = /home/berend/.ansible.cfg configured module search path = ['/home/berend/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python3.7/site-packages/ansible executable location = /usr/bin/ansible python version = 3.7.2 (default, Jan 3 2019, 09:14:01) [GCC 8.2.1 20181215 (Red Hat 8.2.1-6)] Using /home/berend/.ansible.cfg as config file /home/berend/Source/ansible-conf/inventory/acme did not meet host_list requirements, check plugin documentation if this is unexpected /home/berend/Source/ansible-conf/inventory/acme did not meet script requirements, check plugin documentation if this is unexpected Parsed /home/berend/Source/ansible-conf/inventory/acme inventory source with ini plugin /home/berend/Source/ansible-conf/inventory/acme-stores did not meet host_list requirements, check plugin documentation if this is unexpected /home/berend/Source/ansible-conf/inventory/acme-stores did not meet script requirements, check plugin documentation if this is unexpected Parsed /home/berend/Source/ansible-conf/inventory/acme-stores inventory source with ini plugin ... inventory snipped ... Parsed /home/berend/Source/ansible-conf/inventory/test-stores inventory source with ini plugin [pid 25347] 11:38:35.537271 D mitogen: mitogen.service.Pool(0x7f3a7a1a3828, size=16, th='MainThread'): initialized [pid 25347] 11:38:35.538238 D ansible_mitogen.process: Service pool configured: size=16 META: ran handlers [pid 25366] 11:38:35.608042 D mitogen: unix.connect(path='/tmp/mitogen_unix_zbe0aump') [pid 25366] 11:38:35.608895 D mitogen: unix.connect(): local ID is 1, remote is 0 [pid 25347] 11:38:35.611661 D mitogen: mitogen.ssh.Stream('default').connect() [pid 25347] 11:38:35.762189 D mitogen: hybrid_tty_create_child() pid=25369 stdio=59, tty=13, cmd: /home/berend/Source/ansible-conf/bin/timedssh -o "LogLevel ERROR" -l ansible -o "Compression yes" -o "ServerAliveInterval 15" -o "ServerAliveCountMax 3" -o "StrictHostKeyChecking yes" -C -o ControlMaster=auto -o ControlPersist=30m -o ConnectTimeout=60 -o PasswordAuthentication=no example.com /usr/bin/python2.6 -c "'import codecs,os,sys;_=codecs.decode;exec(_(_(\"eNqFkVFLwzAUhZ/XX9G3JCzb0k0RCwVlD+KDCEXcgw5pm1sX7JKQtovbr/euE9bOB9/uxz0353CS8lVi6qlVFigLHPc9UmWIUBr3RVkcjHCWrZ1TwSMh2JlT3ieH2+jERWVqoGkfXB9WffAIaFjv0b7KGnTdhkkSEpk5rzQJMy27JXxD0TZZXkG3nrW1m+VKz+y+2RhNMOfoQjZOusMduFoZ/RYv1p0t6J1yyOQ+fXgVZJ0Mz04axIoOF3yIY0K3qjGfoOMcHGh5VyvYwURCXWxM68FNCjM5ZPH8enF1wwgL8F3vVAM04uTp8eVZCPGuCeYpjMTaWbBMPuixeGksaKybuJywqYNM0mgxvxWMk4Oy+FJpk7NuxYnPyfEvSvtrsOzmU78Xav+f+m/KaJDyB1uNs44=\".encode(),\"base64\"),\"zip\"))'" [pid 25347] 11:38:35.762929 D mitogen: mitogen.ssh.Stream('local.25369').connect(): stdin=59, stdout=62, diag=13 [pid 25347] 11:38:36.560717 D mitogen: mitogen.ssh.Stream('local.25369'): received b'MITO000\n' [pid 25347] 11:38:36.561234 D mitogen: mitogen.ssh.Stream('local.25369')._ec0_received() [pid 25347] 11:38:36.649091 D mitogen: CallChain(Context(2, 'ssh.example.com')).call_async(): ansible_mitogen.target.init_child(log_level=10, candidate_temp_dirs=['~/.ansible/tmp', '/var/tmp', '/tmp']) [pid 25347] 11:38:36.656095 D mitogen: _build_tuple('/usr/lib/python3.7/site-packages/ansible/__init__.py', 'ansible') -> ['cli', 'compat', 'config', 'constants', 'errors', 'executor', 'galaxy', 'inventory', 'module_utils', 'modules', 'parsing', 'playbook', 'plugins', 'release', 'template', 'utils', 'vars'] [pid 25347] 11:38:36.658741 D mitogen: _send_load_module(mitogen.ssh.Stream('ssh.example.com'), 'ansible.release') [pid 25347] 11:38:36.659362 D mitogen: _send_load_module(mitogen.ssh.Stream('ssh.example.com'), 'ansible') [pid 25347] 11:38:36.661716 D mitogen: _build_tuple('/usr/lib/python3.7/site-packages/ansible/module_utils/__init__.py', 'ansible.module_utils') -> ['_text', 'acme', 'ansible_release', 'ansible_tower', 'api', 'aws', 'azure_rm_common', 'azure_rm_common_rest', 'basic', 'cloud', 'cloudscale', 'cloudstack', 'common', 'compat', 'connection', 'crypto', 'database', 'digital_ocean', 'dimensiondata', 'docker_common', 'ec2', 'exoscale', 'f5_utils', 'facts', 'firewalld', 'gcdns', 'gce', 'gcp', 'gcp_utils', 'gitlab', 'heroku', 'ibm_sa_utils', 'infinibox', 'influxdb', 'ipa', 'ismount', 'json_utils', 'k8s', 'keycloak', 'known_hosts', 'ldap', 'lxd', 'manageiq', 'memset', 'mysql', 'net_tools', 'netapp', 'netapp_elementsw_module', 'netapp_module', 'network', 'oneandone', 'oneview', 'online', 'opennebula', 'openstack', 'ovirt', 'parsing', 'postgres', 'powershell', 'pure', 'pycompat24', 'rax', 'redfish_utils', 'redhat', 'remote_management', 'scaleway', 'service', 'six', 'splitter', 'storage', 'univention_umc', 'urls', 'vca', 'vmware', 'vmware_rest_client', 'vultr', 'yumdnf'] [pid 25347] 11:38:36.661983 D mitogen: _send_load_module(mitogen.ssh.Stream('ssh.example.com'), 'ansible.module_utils') [pid 25347] 11:38:36.722858 D mitogen: _get_module_via_sys_modules('select') ->