openedx / event-bus-kafka

Kafka implementation for Open edX event bus
GNU Affero General Public License v3.0
4 stars 5 forks source link

Better error message when consumer not authorized for topic #226

Open rgraber opened 7 months ago

rgraber commented 7 months ago

When we adjusted ACLs for some Kafka topics, a consumer started failing with a misleading error message (Missing ce_type header on message, cannot determine signal) that caused us to think there was a malformed message at the start of the topic that was blocking consumption.

The real error (either Broker: Topic authorization failed or Group authorization failed) was buried in the context data; we should figure out how to surface that error instead. This might involve checking for a None offset or other error indicators before we try inspecting the message headers.

Original description

An error in the discovery consumer:

2024-01-26 14:00:27,100 ERROR 1 [edx_event_bus_kafka.internal.consumer] consumer.py:555 - Error consuming event from Kafka: UnusableMessageError('Missing ce_type header on message, cannot determine signal') in context full_topic='prod-course-authoring-xblock-lifecycle', consumer_group='course_discovery_prod' -- event details: {'partition': 0, 'offset': None, 'headers': None, 'key': None, 'value': b'Subscribed topic not available: prod-course-authoring-xblock-lifecycle: Broker: Topic authorization failed'}Traceback (most recent call last): File "/edx/app/discovery/venvs/discovery/lib/python3.8/site-packages/edx_event_bus_kafka/internal/consumer.py", line 312, in _consume_indefinitely signal = self.determine_signal(msg) File "/edx/app/discovery/venvs/discovery/lib/python3.8/site-packages/edx_event_bus_kafka/internal/consumer.py", line 405, in determine_signal event_type = self._get_event_type_from_message(msg) File "/edx/app/discovery/venvs/discovery/lib/python3.8/site-packages/edx_event_bus_kafka/internal/consumer.py", line 426, in _get_event_type_from_message raise UnusableMessageError(edx_event_bus_kafka.internal.consumer.UnusableMessageError: Missing ce_type header on message, cannot determine signal

It's unclear why the consumer is not able to move past this error

robrap commented 7 months ago

Ideally, if the entire topic is not reachable and we can't get to any messages:

  1. The error should be more clear, and
  2. We should have alerting to immediately detect this, whether it is alerting that goes to the owner or us, or some combo (e.g. safety net).
dianakhuang commented 6 months ago

We believe this was caused by a misconfigured ACL, which has now been corrected. We should have better reporting on when this sort of thing happens so we can fix it.

robrap commented 6 months ago

@timmc-edx will look into rewriting this ticket, potentially splitting into two parts (error message and alerting).

timmc-edx commented 6 months ago

I've updated this ticket, and there are already a couple of tickets to cover the alerting side of things:

dianakhuang commented 6 months ago

After investigating this issue on DataDog, it seems like the consumer lag metric wasn't being recorded for this topic at all before we fixed the ACL. We will probably need to make alerts for this sort of thing based on logs (once we get logs in DataDog, probably).