Open kdelee opened 6 years ago
@mdvickst this is one of the issues I'm filing in response our recent conversation about possible causes of a scan hanging
I actually witnessed this happening on a troubleshooting call this afternoon. It was the virt.num_guests fact that was hanging on centos bare metal servers which the customer said were hosting Docker containers.
It appears this is a pretty common issue as described here. I'm sure there are other examples but this one happened just today.
This issue is quite tricky to solve. It would be nice to solve it at the ssh layer, but as far as I can tell, ssh doesn't support a per-command timeout (based on https://linux.die.net/man/5/ssh_config).
Next option would be to have Ansible provide the timeout, but Ansible doesn't support a timeout in the raw module, which is what we use (based on http://docs.ansible.com/ansible/latest/raw_module.html).
We can't do it in rho, because rho doesn't have enough hooks into Ansible to cancel a single command and move on to the next one without cancelling the whole playbook.
One option which might conceivably work but would be setting Ansible's ssh_executable
(http://docs.ansible.com/ansible/latest/intro_configuration.html#ssh-executable) configuration to a bash script that did timeout 10 ssh $@
or something like that, but this is extremely hacky. (It would also require Ansible 2.2 or greater.)
To be clear, I think the best thing to do for the long run is to make the fix in Ansible. But until that happens, the hack above might possibly work.
Specify type:
Priority:
Description:
While individual instances of this bug have been encountered and resolved, for example the pagination of systemctl output, I think this has the potential of coming up over and over again, especially on alternate OSes and login shells that we are not targeting in our test scenarios.
While our main concern is RHEL-flavored linux systems, we will invariably run into oddballs with strange shells and unconventional setups, and we need to be resilient to these occurrences and not allow encountering these systems to prevent us from gleaning information from the systems we can properly access.
A easy to think of example is that a command that is not available on a system is executed, but instead of just giving us a non-zero exit code, the shell prompts us and says, "A package is available that provides that command, would you like to install it [N/y]?"
Bug Report
Version of rho:
[ 0.0.28, 0.0.29, 0.0.30, 0.0.31 ]
Expected behavior:
I expect all tasks will time out after some period of time and not hang indefinitely.
Actual behavior:
It is entirely possible for a task to hang indefinitely and we have been seeing many permutation of this.
Steps to reproduce:
Since we have seen this arise in various scenarios, and I desire in fact to prevent future situations, I recommend reproducing this by creating a task that will intentionally take a very long time on your test machine. For example,
for rhel 5/6
for rhel 7/recent fedora
Possible solution:
As I was researching the problem of the scan hanging if a host is lost in the middle of a long running task, I came accross the async/poll feature of ansible. See this issue comment.