Occasional failure to `loginByToken` produces hang/timeout

zacwest commented 1 year ago

Thanks for the library, it's made things a lot easier! I'm running into an issue where invocations end up being timed out by Ansible after some kind of internal failure. My setup is somewhat simple: Uptime-Kuma is running in a docker container on fly.io.

For example, running a command like:

- name: Get Uptime Kuma push monitor info
  delegate_to: 127.0.0.1
  become: false
  throttle: 1
  lucasheld.uptime_kuma.monitor_info:
    api_url: "{{ uptime_kuma_url }}"
    api_token: "{{ uptime_kuma_api_token }}"
    name: "{{ monitor_name }}"

I've traced this back to a timeout occurring in socketio (the log output here is my executing the ansible-generated python script manually repeatedly to try and induce the failure) and a raised exception going uncaught:

Traceback (most recent call last):
  File "/Users/zac/Servers/ovh/./test.py", line 107, in <module>
    _ansiballz_main()
  File "/Users/zac/Servers/ovh/./test.py", line 99, in _ansiballz_main
    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
  File "/Users/zac/Servers/ovh/./test.py", line 47, in invoke_module
    runpy.run_module(mod_name='ansible_collections.lucasheld.uptime_kuma.plugins.modules.monitor_info', init_globals=dict(_module_fqn='ansible_collections.lucasheld.uptime_kuma.plugins.modules.monitor_info', _modlib_path=modlib_path),
  File "/opt/homebrew/Cellar/python@3.10/3.10.10/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 224, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/opt/homebrew/Cellar/python@3.10/3.10.10/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/opt/homebrew/Cellar/python@3.10/3.10.10/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/var/folders/1y/9pbgc3zx1kb_1mgvd5m97xc40000gn/T/ansible_lucasheld.uptime_kuma.monitor_info_payload_ha8zoka_/ansible_lucasheld.uptime_kuma.monitor_info_payload.zip/ansible_collections/lucasheld/uptime_kuma/plugins/modules/monitor_info.py", line 404, in <module>
  File "/var/folders/1y/9pbgc3zx1kb_1mgvd5m97xc40000gn/T/ansible_lucasheld.uptime_kuma.monitor_info_payload_ha8zoka_/ansible_lucasheld.uptime_kuma.monitor_info_payload.zip/ansible_collections/lucasheld/uptime_kuma/plugins/modules/monitor_info.py", line 381, in main
  File "/Users/zac/Servers/ovh/.venv/lib/python3.10/site-packages/uptime_kuma_api/api.py", line 2552, in login_by_token
    return self._call('loginByToken', token)
  File "/Users/zac/Servers/ovh/.venv/lib/python3.10/site-packages/uptime_kuma_api/api.py", line 480, in _call
    r = self.sio.call(event, data)
  File "/Users/zac/Servers/ovh/.venv/lib/python3.10/site-packages/socketio/client.py", line 471, in call
    raise exceptions.TimeoutError()
socketio.exceptions.TimeoutError

I added some logging around the call site in api.py:

https://github.com/lucasheld/uptime-kuma-api/blob/master/uptime_kuma_api/api.py#L478-L484

What's happening appears to be the loginByToken call attempts to occur, but times out. Weirdly, I do see this request coming through on the Uptime Kuma side:

2023-02-18T17:27:19Z app[21342cbf] sjc [info]2023-02-18T09:27:19-08:00 [AUTH] INFO: Login by token. IP=<snip>
2023-02-18T17:27:19Z app[21342cbf] sjc [info]2023-02-18T09:27:19-08:00 [AUTH] INFO: Username from JWT: <snip>
2023-02-18T17:27:19Z app[21342cbf] sjc [info]2023-02-18T09:27:19-08:00 [AUTH] INFO: Successfully logged in user <snip>. IP=<snip>

When this occurs, I see the _send call begin, but it never returns until it raises the exception, which doesn't appear to be caught successfully. The end result is the python script hangs indefinitely and ends up being killed by Ansible after the timeout, rather than sending the error up the stack.

So perhaps 2 things here:

If this error occurs, it should be caught and raised to Ansible so it can do its own retry logic rather than timing out, which I think is unretryable?
Something on the Uptime Kuma side, the Python API side, or the invocation by the Ansible library is failing to handle the response to the login call, but I haven't had a moment to stick another reverse proxy in front of Uptime Kuma to see if it is actually sending an HTTP response.

lucasheld commented 1 year ago

Thank you for the research. Retries on timeouts would be useful, also for other temporary network errors. Also logging should help to debug these problems.

I think I should refactor the code first so that there is less duplicate code and the two things can be implemented more easily.

neilbags commented 1 year ago

Thanks for this excellent ansible integration. It work as expected and right out-of-the-box.

I am however seeing this issue. When adding ~40 monitors, at least a few of them will fail every time. It doesn't appear to be related to server load, and it shouldn't be a network problem - the sites I am monitoring are all on the same server ~20ms away and the connection is solid.

It doesn't appear to matter whether you use a token or a username/password.

I can't see any errors in docker logs or nginx's error.log

Using throttle:1 has no effect, nor does forks = 1.

I suspect #20 may be a symptom of the same issue as I saw this behaviour initially as well.

I can reproduce this every time so can do testing if you can think of anything that will help

neilbags commented 1 year ago

Just one more bit of info:

Sometimes the monitor is added even when ansible says its failed, but sometimes it isn't

namvan commented 1 year ago

Did you guys find a work-around for this as I am completely stuck with frozen runs?

namvan commented 1 year ago

Just a quick note to you all that it seems to be an issue with the reverse proxy for me. I am using haproxy. Pointing uptime_url direct to the app worked out perfectly and of course unsecure perfectly.

exismys commented 1 year ago

Just one more bit of info:

Sometimes the monitor is added even when ansible says its failed, but sometimes it isn't

I faced the same issue. It shows an error on the Ansible side: But the monitor has been successfully created on the uptime-kuma side (no errors in docker logs):

etpedro commented 1 year ago

Hi!

I'm having the same issue. I'm currently hosting Uptime Kuma on an Azure Web App and the Ansible playbook hangs everytime while executing different tasks.

Any idea on how to overcome this?

derekkddj commented 6 months ago

i have the same problem

invisibleninja06 commented 6 months ago

Same issue here too, really annoying and trying to retry when it occurs so far is not working. Makes using the module rather unstable and need to rerun playbooks over and over till everything is created.

invisibleninja06 commented 6 months ago

Ok one thing to help people is to add retries to uptime kuma tasks

something like

register: task_results
retries: 5
until: task_results.rc | default(0) == 0
ignore_errors: true

This will set the return code to 0 if not defined and retry is its anything other than 0 When it hits those timeouts the return code (rc) is 1 so it will trigger a retry. ignore_errors is set to false so that the exception doesnt stop the playbook in its tracks.

Hope this helps someone hitting the same issue

lucasheld / ansible-uptime-kuma

Occasional failure to `loginByToken` produces hang/timeout #17