openedx / wg-data

Tracking work and progress of the Open edX Data Working Group
1 stars 2 forks source link

Subset of Tracking Log Events on Message Bus #28

Closed jmakowski1123 closed 8 months ago

jmakowski1123 commented 1 year ago

To extend our event handling capabilities we would like to be able to put tracking logs onto the event bus. This epic is to track that work, and the first consumer of those logs - event-routing-backends optionally using it instead of the Celery-based async backend.

Existing task that should be completed as a prerequisite to this work: https://github.com/openedx/openedx-events/issues/210

High level tasks to be groomed:

bmtcril commented 1 year ago

@Ian2012 this is as far as I've gotten today on this work, still need to write the other 2 tickets

mariajgrimaldi commented 1 year ago

I have a few comments regarding Create a single “tracking log” OpenEdxPublicSignal:

@attr.s(frozen=True)
class ContextTrackingLogBase:
    course_id = attr.ib(type=str)
    org_id = attr.ib(type=str)
    path = attr.ib(type=str)
    user_id = attr.ib(type=str)

Class ContextTrackingLog(ContextTrackingLogBase):
    course_user_tags = attr.ib(type=Optional[dict])
    module = attr.ib(type=Optional[dict])

@attr.s(frozen=True)
class TrackingLogData:
    accept_language = attr.ib(type=str)
    agent = attr.ib(type=str)
    context = attr.ib(type=ContextTrackingLog)
    event = attr.ib(type=dict)
    event_source = attr.ib(type=str)
    event_type = attr.ib(type=str)
    host = attr.ib(type=str)
    ip = attr.ib(type=str)
    name = attr.ib(type=str)
    page = attr.ib(type=str)
    referer = attr.ib(type=str)
    session = attr.ib(type=str)
    time = attr.ib(type=str)
    username = attr.ib(type=str)

Now, the event field varies according to the event we're sending. So, we could only validate the tracking log common data using the data class we just created (which happens automatically when sending the event). But we couldn't do much with the info inside the event field -- at least as it is right now.

org.openedx.learning.course.enrollment.created.v1

Each event_type is created according to this ADR. Now, how would this look like with a single tracking log event? 🤔 I see in the documentation of the events that some events are categorized as student or course. I don't know if this could be used to create more than one event so we can continue following the naming convention.

mariajgrimaldi commented 1 year ago

About subdomains

After reviewing architectural subdomains again, specifically after seeing this image: image Found here. I see that there's a bounded context called Analytics, which is traversal to all subdomains, now I wonder if we should create a tracking log Open edX event for each subdomain. 🤔

We might need to ask an expert 😉. What do you think @ormsbee?

bmtcril commented 1 year ago

I think this is all reasonable. Sadly a lot of the time various fields on the tracking log entries will be empty strings, and if you haven't I think it would make sense to confirm these are the only keys we're going to see. There exists some possibility that people will want to replay very old tracking logs which may have different or broken formats. You all are probably in a better position than I am to check that, though. :)

For the arch domain I personally think a single cross-cutting Analytics subdomain makes the most sense as shown in the image above and this one: image

But it's a loosely held opinion and I could be convinced to go a different direction.

Ian2012 commented 1 year ago

Proper PRs has been created for this requirement:

In the proposed approach, the data format was updated to match the schema defined here because data and context are dynamic variables I think is not important to define a rigid schema for the tracking log data:

event = {
    'name': name or UNKNOWN_EVENT_TYPE,
    'timestamp': datetime.now(UTC),
    'data': data or {},
    'context': self.resolve_context()
}
bmtcril commented 1 year ago

We had a reminder from @robrap that the volume of these events may require us to look at making some of the event bus audit logging configurable to avoid performance issues related to log spam. As part of testing these PRs we should make sure that the associated logging is of an acceptable volume.

Ian2012 commented 1 year ago

The current state of this work is the following:

This can be tested by defining the following settings in your environment:

      EVENT_TRACKING_BACKENDS = {
        'event_bus': {
          'ENGINE':  'eventtracking.backends.event_bus.EventBusRoutingBackend',
        },
      }
      EVENT_BUS_PRODUCER = 'edx_event_bus_redis.create_producer'
      EVENT_BUS_REDIS_CONNECTION_URL = 'redis://@redis:6379/'
      EVENT_BUS_TOPIC_PREFIX = 'dev'
      EVENT_BUS_CONSUMER = 'edx_event_bus_redis.RedisEventConsumer'

@robrap can you take a look at the changes here: https://github.com/openedx/event-tracking/pull/246

The workaround is basically to avoid an inifite loop there by re-emitting the signals. ERB can't be converted in an IDA as ERB depends on accesing the same database as the LMS, so it's better to consume the events in the same service.

External services still can access to the event bus data by using TRACKING_EVENT_EMITTED, but TRACKING_EVENT_EMITTED_TO_BUS is used here internally.

Also, this can be easily run as:

tutor dev exec lms ./manage.py lms consume_events -t analytics -g event_routing_backends --extra '{"consumer_name": "aspects"}'

cc @felipemontoya @mariajgrimaldi @bmtcril

robrap commented 1 year ago

I have not reviewed any code, but simply providing https://github.com/openedx/openedx-events/issues/79 with some past discussion around avoiding infinite loops.

bmtcril commented 8 months ago

The infinite loop problem has been addressed here: https://github.com/openedx/openedx-events/pull/312 While there is still an open PR for batching events, I think the work completed is sufficient to close this out.