Subset of Tracking Log Events on Message Bus

jmakowski1123 commented 1 year ago

To extend our event handling capabilities we would like to be able to put tracking logs onto the event bus. This epic is to track that work, and the first consumer of those logs - event-routing-backends optionally using it instead of the Celery-based async backend.

Existing task that should be completed as a prerequisite to this work: https://github.com/openedx/openedx-events/issues/210

High level tasks to be groomed:

[x] Create a single “tracking log” OpenEdxPublicSignal that can be used for all tracking log events: https://github.com/openedx/openedx-events/issues/215
[x] Modify event-tracking to have an event bus “backend” similar to the existing Celery async one for publishing tracking log events to the bus, make sure this backend can take an "allow list" and a "deny list" of specific events that operators can configure: https://github.com/openedx/event-tracking/pull/246
[x] Create an event bus consumer of the “tracking log” OpenEdxPublicSignal in event-routing-backends that operates similar to the way it does now, but instead of creating a celery task for every transformed event it will batch them before sending to Ralph. Something like “send every 100 events or 2 seconds”: https://github.com/openedx/event-routing-backends/pull/361 and https://github.com/openedx/event-routing-backends/pull/388

bmtcril commented 1 year ago

@Ian2012 this is as far as I've gotten today on this work, still need to write the other 2 tickets

mariajgrimaldi commented 1 year ago

I have a few comments regarding Create a single “tracking log” OpenEdxPublicSignal:

Each Open edX Event has its own data class. In case of having a single tracking log event, we should use a generic data class capable of accepting anything inside the event field which is the one that varies among events -- please let me know if I'm wrong. I imagine our data class would look something like this:

@attr.s(frozen=True)
class ContextTrackingLogBase:
    course_id = attr.ib(type=str)
    org_id = attr.ib(type=str)
    path = attr.ib(type=str)
    user_id = attr.ib(type=str)

Class ContextTrackingLog(ContextTrackingLogBase):
    course_user_tags = attr.ib(type=Optional[dict])
    module = attr.ib(type=Optional[dict])

@attr.s(frozen=True)
class TrackingLogData:
    accept_language = attr.ib(type=str)
    agent = attr.ib(type=str)
    context = attr.ib(type=ContextTrackingLog)
    event = attr.ib(type=dict)
    event_source = attr.ib(type=str)
    event_type = attr.ib(type=str)
    host = attr.ib(type=str)
    ip = attr.ib(type=str)
    name = attr.ib(type=str)
    page = attr.ib(type=str)
    referer = attr.ib(type=str)
    session = attr.ib(type=str)
    time = attr.ib(type=str)
    username = attr.ib(type=str)

Now, the event field varies according to the event we're sending. So, we could only validate the tracking log common data using the data class we just created (which happens automatically when sending the event). But we couldn't do much with the info inside the event field -- at least as it is right now.

All Open edX Events are namespaced by an architectural subdomain. For example, events related to a student's enrollment belong to the subdomain learning, so they live inside the /learning submodule and have learning as part of their event_type:

org.openedx.learning.course.enrollment.created.v1

Each event_type is created according to this ADR. Now, how would this look like with a single tracking log event? 🤔 I see in the documentation of the events that some events are categorized as student or course. I don't know if this could be used to create more than one event so we can continue following the naming convention.

Also, tracking logs have their own event_type, which differs from the Open edX Events event_type mentioned before. I hope the tracking logs event_type does not confuse the user.
When producing Open edX Events to the event bus, we'll need this info. That means we must include it in the tracking log data class and specify it while sending the event -- at least the first three fields since the other ones won't change among events.

mariajgrimaldi commented 1 year ago

About subdomains

After reviewing architectural subdomains again, specifically after seeing this image: Found here. I see that there's a bounded context called Analytics, which is traversal to all subdomains, now I wonder if we should create a tracking log Open edX event for each subdomain. 🤔

We might need to ask an expert 😉. What do you think @ormsbee?

bmtcril commented 1 year ago

I think this is all reasonable. Sadly a lot of the time various fields on the tracking log entries will be empty strings, and if you haven't I think it would make sense to confirm these are the only keys we're going to see. There exists some possibility that people will want to replay very old tracking logs which may have different or broken formats. You all are probably in a better position than I am to check that, though. :)

For the arch domain I personally think a single cross-cutting Analytics subdomain makes the most sense as shown in the image above and this one:

But it's a loosely held opinion and I could be convinced to go a different direction.

Ian2012 commented 1 year ago

Proper PRs has been created for this requirement:

https://github.com/openedx/openedx-events/pull/230

In the proposed approach, the data format was updated to match the schema defined here because data and context are dynamic variables I think is not important to define a rigid schema for the tracking log data:

event = {
    'name': name or UNKNOWN_EVENT_TYPE,
    'timestamp': datetime.now(UTC),
    'data': data or {},
    'context': self.resolve_context()
}

https://github.com/openedx/event-tracking/pull/225

bmtcril commented 1 year ago

We had a reminder from @robrap that the volume of these events may require us to look at making some of the event bus audit logging configurable to avoid performance issues related to log spam. As part of testing these PRs we should make sure that the associated logging is of an acceptable volume.

Ian2012 commented 1 year ago

The current state of this work is the following:

The signal is defined here (with a helper signal explained below): https://github.com/openedx/openedx-events/pull/230
The signal is emitted and captured here: https://github.com/openedx/event-tracking/pull/246. It's also sent to the event-bus but it's sending a second signal so that it doesn't enters in a infinite loop receiving/producing the same event.
The consumer code is here: https://github.com/openedx/event-routing-backends/pull/361

This can be tested by defining the following settings in your environment:

      EVENT_TRACKING_BACKENDS = {
        'event_bus': {
          'ENGINE':  'eventtracking.backends.event_bus.EventBusRoutingBackend',
        },
      }
      EVENT_BUS_PRODUCER = 'edx_event_bus_redis.create_producer'
      EVENT_BUS_REDIS_CONNECTION_URL = 'redis://@redis:6379/'
      EVENT_BUS_TOPIC_PREFIX = 'dev'
      EVENT_BUS_CONSUMER = 'edx_event_bus_redis.RedisEventConsumer'

@robrap can you take a look at the changes here: https://github.com/openedx/event-tracking/pull/246

The workaround is basically to avoid an inifite loop there by re-emitting the signals. ERB can't be converted in an IDA as ERB depends on accesing the same database as the LMS, so it's better to consume the events in the same service.

External services still can access to the event bus data by using TRACKING_EVENT_EMITTED, but TRACKING_EVENT_EMITTED_TO_BUS is used here internally.

Also, this can be easily run as:

tutor dev exec lms ./manage.py lms consume_events -t analytics -g event_routing_backends --extra '{"consumer_name": "aspects"}'

cc @felipemontoya @mariajgrimaldi @bmtcril

robrap commented 1 year ago

I have not reviewed any code, but simply providing https://github.com/openedx/openedx-events/issues/79 with some past discussion around avoiding infinite loops.

bmtcril commented 8 months ago

The infinite loop problem has been addressed here: https://github.com/openedx/openedx-events/pull/312 While there is still an open PR for batching events, I think the work completed is sufficient to close this out.

openedx / wg-data

Subset of Tracking Log Events on Message Bus #28

About subdomains