mitodl / ol-data-platform

Pipeline definitions for managing data flows to power analytics at MIT Open Learning
BSD 3-Clause "New" or "Revised" License
36 stars 6 forks source link

edx.org retired users issue #672

Closed rachellougee closed 1 year ago

rachellougee commented 1 year ago

Description

There are some retired users dropping out of int__edxorg__mitx_users due to their username and email being marked as retired__user_86b9376060f9e5xxxx@retired.invalid in either mitx_person_course or email_opt_in.

Currently int__edxorg__mitx_users is generated based on their records in both mitx_person_course, mitx_user_info_combo, and email_opt_in (for the latest email). However when a user is marked as retired in edxorg, their email and username are replaced by hash value, and it causes user removed from int__edxorg__mitx_users because their username and email can no longer be matched between 3 sources tables

As we inner join int__edxorg__mitx_users in the enrollments and certificates, these removed user enrollments and certificates are also dropped from our models due to this behavior.

Expected Behavior

Retired users should remain in int__edxorg__mitx_users marked as is_active = false or some indication that user is retired. They should not be dropped from the user table, and their enrollments and certificates should remain in the corresponding models

Actual Behavior

Retired users are removed from int__edxorg__mitx_users, their enrollments and certificates no longer exist

rachellougee commented 1 year ago

Talking to Jon about retired users, the data inconsistency must have to do with synchronization, email_opt_in table is refreshed frequently, mitx_user_info_combo and person_course are refreshed less frequently, username and email are not in sync between these 3 tables, we would expect to see them retired from all 3 tables eventually. In short, we still need a way to ensure we don't drop retired users because of data inconsistency

rachellougee commented 1 year ago

didn't add is_active to edxorg users as there aren't solid ways to determine if users are active or not based on data in mitx_user_info_combo and emal_opt_in. Users who are marked as retired__user_xxxretired.invalid are retired by edx, but for courses that are super old, IRx doesn't reprocess those data, so these rows would remain the same and not marked as retired in their username. Other than that, https://github.com/mitodl/ol-data-platform/pull/689 addresses the issue where newly retired users drop out of the edxorg user table.