mitogen-hq / mitogen

Distributed self-replicating programs in Python
https://mitogen.networkgenomics.com/
BSD 3-Clause "New" or "Revised" License
2.34k stars 199 forks source link

mitogen.core.StreamError: invalid message: cannot unpickle Context: bad input #689

Open madrover opened 4 years ago

madrover commented 4 years ago

Hi When trying to execute any ansible command, either adhoc or on playbook, against some hosts it fais with the following error:

[mux  28801] 23:45:24.171948 D mitogen.[fork.12844]: Recovered sys.executable: '/usr/bin/python'
[mux  28801] 23:45:24.172174 D ansible_mitogen.target.[ssh.server1.sudo.root]: Selected temp directory: u'/root/.ansible/tmp' (from [u'/root/.ansible/tmp', u'/var/tmp', u'/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/centos'])
[mux  28801] 23:45:24.172361 D mitogen.[ssh.server1.sudo.root]: Dispatcher: Message(4, 0, 0, 101, 1002, '\x80\x02(NX\x16\x00\x00\x00ansible_mitogen.targetq\x00NX\n\x00\x00\x00init_childq'..186) -> {u'fork_context': Context(1003, u'fork.12844'), u'home_dir': u'/root', u'good_temp_dir': u'/root/.ansible/tmp'}
ERROR! [task 28836] 23:45:24.173411 E mitogen: raw pickle was: b'\x80\x02}q\x00(X\x07\x00\x00\x00contextq\x01cmitogen.core\n_unpickle_context\nq\x02K\x04Xi\x00\x00\x00ssh.server1.sudo.rootq\x03\x86q\x04Rq\x05X\x03\x00\x00\x00viaq\x06h\x02K\x02X_\x00\x00\x00ssh.server1q\x07\x86q\x08Rq\tX\x11\x00\x00\x00init_child_resultq\n}q\x0b(X\x0c\x00\x00\x00fork_contextq\x0ch\x02M\xeb\x03X\n\x00\x00\x00fork.12844q\r\x86q\x0eRq\x0fX\x08\x00\x00\x00home_dirq\x10X\x05\x00\x00\x00/rootq\x11X\r\x00\x00\x00good_temp_dirq\x12X\x12\x00\x00\x00/root/.ansible/tmpq\x13uX\x03\x00\x00\x00msgq\x14Nu.'
[task 28836] 23:45:24.174361 D ansible_mitogen.mixins: _remove_tmp_path(None)
[task 28836] 23:45:24.178439 D mitogen: MitogenProtocol(unix_listener.28801): disconnecting
[task 28836] 23:45:24.179229 D mitogen: Waker(fd=11/12): disconnecting
[mux  28801] 23:45:24.179270 D mitogen: <Side of unix_client.28836 fd 76>: empty read, disconnecting
[task 28836] 23:45:24.180078 D mitogen: Router(Broker(4780)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
[mux  28801] 23:45:24.180053 D mitogen: MitogenProtocol(unix_client.28836): disconnecting
The full traceback is:
Traceback (most recent call last):
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/mitogen/core.py", line 952, in unpickle
    obj = unpickler.load()
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/mitogen/core.py", line 822, in _unpickle_context
    return _unpickle_context(context_id, name, router=self.router)
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/mitogen/core.py", line 2307, in _unpickle_context
    raise TypeError('cannot unpickle Context: bad input')
TypeError: cannot unpickle Context: bad input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible/executor/task_executor.py", line 146, in run
    res = self._execute()
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible/executor/task_executor.py", line 645, in _execute
    result = self._handler.run(task_vars=variables)
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible_mitogen/mixins.py", line 121, in run
    return super(ActionModuleMixin, self).run(tmp, task_vars)
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible/plugins/action/gather_facts.py", line 79, in run
    res = self._execute_module(module_name=fact_module, module_args=mod_args, task_vars=task_vars, wrap_async=False)
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible_mitogen/mixins.py", line 351, in _execute_module
    self._set_temp_file_args(module_args, wrap_async)
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible_mitogen/mixins.py", line 330, in _set_temp_file_args
    self._connection.get_good_temp_dir()
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible_mitogen/connection.py", line 775, in get_good_temp_dir
    self._connect()
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible_mitogen/connection.py", line 797, in _connect
    self._connect_stack(stack)
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/ansible_mitogen/connection.py", line 748, in _connect_stack
    stack=mitogen.utils.cast(list(stack)),
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/mitogen/service.py", line 120, in call
    return call_context.call_service(service_name, method_name, **kwargs)
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/mitogen/core.py", line 2276, in call_service
    return recv.get().unpickle()
  File "/home/user/ansible/.venv/lib/python3.6/site-packages/mitogen/core.py", line 959, in unpickle
    raise StreamError('invalid message: %s', e)
mitogen.core.StreamError: invalid message: cannot unpickle Context: bad input
fatal: [server1]: FAILED! => {
    "msg": "Unexpected failure during module execution.",
    "stdout": ""
}

This error is reproducible only when using become, otherwise it seems to work.

Thanks!

madrover commented 4 years ago

Can I have some guidance on this one, please?

s1113950 commented 4 years ago

Please try Ansible 2.8.8; Ansible 2.9 isn't fully supported yet. Also if you want you can try my patch that fixes some things as well: https://github.com/dw/mitogen/pull/658 It mainly fixes ansible_python_interpreter discovery but touches some other stuff too. After my patch lands I hope to spend some time getting Ansible 2.9 supported fully.

madrover commented 4 years ago

The misbehavior can be reproduced in Ansible 2.8.8. However, I have determined what's causing the issue; we have a role which includes a library which implements a task that we use to validate a license that we use to deploy a component.

The module is called this way:

- name: Test that license has not expired
license_check:
  license_file: "{{ license_file }}"
delegate_to: localhost

And the library content is:

#!/usr/bin/python

from ansible.module_utils.basic import AnsibleModule
from ansible.utils.display import Display
from datetime import datetime
from datetime import timedelta
import re
import os

def main():

  result = dict(
    changed = False,
    warnings = [],
    renewal_date = ''
  )

  module = AnsibleModule(
    argument_spec = dict(
      license_file = dict(required=True),
      warning_weeks = dict(required=False, type='int', default=12)
    ),
    supports_check_mode=True
  )

  pattern = re.compile("^key.renewal=(.*)")

  try:
    for line in open(module.params['license_file']):
      for match in re.finditer(pattern, str(line)):
        expiry_date = datetime.strptime(match.group(1), '%Y%m%d')
        result['renewal_date'] = expiry_date.strftime('%Y-%m-%d')
        expiry_window = expiry_date - timedelta(weeks=module.params['warning_weeks'])
        now = datetime.now()

        if expiry_date < now:
          fail_msg = 'The license expired on ' + expiry_date.strftime('%d %b %Y') + '.'
          module.fail_json(msg = fail_msg)

        elif expiry_window < now:
          fail_msg = 'The license is due to expire on ' + expiry_date.strftime('%d %b %Y') + '.'
          result['warnings'].append(fail_msg)

  except IOError:
    module.fail_json(msg = 'Could not open the requested file: ' + module.params['license_file'])

  module.exit_json(**result)

if __name__ == '__main__':
  main();

If we remove the call to that module then the playbook executes without problems.

s1113950 commented 4 years ago

Thanks for the info 👍 I should have time in the next few days to repro on my end with your custom module and see if I can fix your issue

s1113950 commented 4 years ago

I'm working on this here: https://github.com/s1113950/mitogen-test/commit/6029980ad0fbada29c33afc219f0fd494a409e4f , running with make custom-lib-unpickle-test. I get a different error than you do though, possibly because I'm not targeting a centos6 machine yet. Don't have time to work on this anymore today, but plan on tackling it more soon.

s1113950 commented 4 years ago

Do you have a specific warning_weeks value or license_file I should use to repro?

s1113950 commented 4 years ago

@madrover Try removing this import from your license_file:

from ansible.utils.display import Display

It works for me targeting a centos6 docker host from my Mac (closest setup I could make):

TASK [run_test : Create license file on docker host] *************************************************************************************************************
task path: pathHere
changed: [localhost -> testMitogen] => {"ansible_facts": {}, "changed": true, "cmd": "echo \"LICENSE\" > /tmp/license_file.txt", "delta": "0:00:00.003377", "end":
 "2020-03-26 07:09:30.639485", "rc": 0, "start": "2020-03-26 07:09:30.636108", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

TASK [run_test : Test that license has not expired] **************************************************************************************************************
task path: pathHere
ok: [localhost -> testMitogen] => {"ansible_facts": {}, "changed": false, "renewal_date": ""}

via

- name: Create license file on docker host
  shell: echo "LICENSE" > /tmp/license_file.txt

- name: Test that license has not expired
  license_check:
    license_file: /tmp/license_file.txt

with a delegate_to pointing to my centos6 container.

I tested with a license file that just contained the word LICENSE in it since I'm not sure what kind of license file you're using.

If that doesn't work for you then I'll need a sample license_file and warning_weeks value to get a more precise test :)

Tristan971 commented 6 months ago

While I'm unsure if it's the same issue, the same message was raised for me on Python 3.11.6 and fixed with minor version bumps, so if someone else runs into it: