grafana / oncall

Developer-friendly incident response with brilliant Slack integration
GNU Affero General Public License v3.0
3.43k stars 275 forks source link

Telegram polling doesn't add buttons for incidents e.g. "acknowledge", "resolve", "mute forever" randomly #3055

Open EsDmitrii opened 11 months ago

EsDmitrii commented 11 months ago

What went wrong?

What happened: Stats:

Issue: We faced that telegram polling randomly stops to add buttons on messages with incidents in telegram channel Example: We get incident message + message with buttons. It is expected behaviour.

Screenshot 2023-09-22 at 11 43 21

Here as you can see we got incident message without any additional message with buttons. It is not expected behaviour.f

Screenshot 2023-09-22 at 11 42 20

Pod's logs says this:

source=engine:app google_trace_id=none logger=engine.management.commands.start_telegram_polling Update from Telegram: {'message': {'message_id': 429, 'entities': [{'offset': 0, 'length': 1, 'url': 'https://grafana.my.awesome.domain/a/grafana-oncall-app/?oncall-uuid=8ec126df-c5fc-4c72-8093-a39f8d767df8', 'type': 'text_link'}, {'offset': 98, 'length': 81, 'type': 'url'}], 'delete_chat_photo': False, 'date': 1695371212, 'caption_entities': [], 'photo': [], 'supergroup_chat_created': False, 'is_automatic_forward': True, 'forward_signature': 'oncall_bot', 'text': '\u200dšŸ”“ #137, [firing:2] InstanceDown \nFiring, alerts: 1\nSource: [Alertmanager] DevOps - Alertmanager\nhttps://grafana.my.awesome.domain/a/grafana-oncall-app/alert-groups/IR9L65GXWL5EK\n\nSummary: \nSeverity: critical šŸšØ\nStatus: firing šŸ”„ (on the source)\nFiring alerts ā€“ 2\nResolved alerts ā€“ 0\n___\nDescription: localhost:8081 of job node has been down for more than 1 minute.\n- group: production\n- instance: localhost:8081\n___\nDescription: localhost:8082 of job node has been down for more than 1 minute.\n- group: canary\n- instance: localhost:8082\n___\nCommonLabels:\n- job: node\nView in AlertManager', 'forward_date': 1695371209, 'group_chat_created': False, 'forward_from_message_id': 139, 'chat': {'title': 'OnCall - Chat', 'id': MASKED, 'type': 'supergroup'}, 'channel_chat_created': False, 'new_chat_members': [], 'forward_from_chat': {'title': 'OnCall', 'id': MASKED, 'type': 'channel'}, 'sender_chat': {'title': 'OnCall', 'id': MASKED, 'type': 'channel'}, 'new_chat_photo': [], 'from': {'id': 777000, 'first_name': 'Telegram', 'is_bot': False}}, 'update_id': 662485161}
source=engine:app google_trace_id=none logger=telegram.ext.dispatcher An uncaught error was raised while handling the error.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
psycopg2.OperationalError: SSL SYSCALL error: EOF detected

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/telegram/ext/dispatcher.py", line 569, in process_update
    self.dispatch_error(update, exc)
  File "/usr/local/lib/python3.11/site-packages/telegram/ext/dispatcher.py", line 813, in dispatch_error
    callback(update, context)
  File "/etc/app/engine/management/commands/start_telegram_polling.py", line 36, in error_handler
    raise context.error
  File "/usr/local/lib/python3.11/site-packages/telegram/ext/dispatcher.py", line 557, in process_update
    handler.handle_update(update, self, check, context)
  File "/usr/local/lib/python3.11/site-packages/telegram/ext/handler.py", line 199, in handle_update
    return self.callback(update, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/etc/app/engine/management/commands/start_telegram_polling.py", line 44, in handle_message
    UpdateManager.process_update(update)
  File "/etc/app/apps/telegram/updates/update_manager.py", line 25, in process_update
    cls._update_entity_names(update)
  File "/etc/app/apps/telegram/updates/update_manager.py", line 56, in _update_entity_names
    cls._update_channel_and_group_names(update)
  File "/etc/app/apps/telegram/updates/update_manager.py", line 70, in _update_channel_and_group_names
    ).update(channel_name=channel_name, discussion_group_name=discussion_group_name)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/models/query.py", line 783, in update
    rows = query.get_compiler(self.db).execute_sql(CURSOR)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/models/sql/compiler.py", line 1559, in execute_sql
    cursor = super().execute_sql(result_type)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 79, in _execute
    with self.db.wrap_database_errors:
  File "/usr/local/lib/python3.11/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django.db.utils.OperationalError: SSL SYSCALL error: EOF detected

To fix this issue temporary (to start receive buttons) you need to kill the pod (restart container)

How do we reproduce it?

We faced it randomly, so I don't know how to reproduce it

Grafana OnCall Version

v1.3.37

Product Area

Alert Flow & Configuration, Chatops, Helm

Grafana OnCall Platform?

Kubernetes

User's Browser?

No response

Anything else to add?

No response

Sammyant commented 10 months ago

We experienced the same issue with telegram polling pod (v.1.3.45)

Adding DB options doesn't help

externalPostgresql: .... options: >- keepalives=1 keepalives_idle=30 keepalives_interval=10 keepalives_count=5

File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ django.db.utils.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

Sammyant commented 10 months ago

I think it's not grafana oncall bug or issue. It's more relevant to haproxy in front of patroni cluster https://github.com/zalando/patroni/issues/820 If we exclude haproxy for test purpose the issue doesn't reproduce