tl-its-umich-edu / my-learning-analytics

My Learning Analytics (MyLA)
Apache License 2.0
36 stars 39 forks source link

Issue in cron when same user has multiple sis_names #1559

Open jonespm opened 9 months ago

jonespm commented 9 months ago

Thank you for contributing to this project!

Describe the bug (Tell us what happens instead of the expected behavior) :

It was noted by Unizin that when running their cron, some user_ids can come back with multiple sis_names set. This could have implications for application behavior as well as impersonation possibilities. I didn't see anything like this on our local instance but that doesn't mean it can't happen. This change was introduced in the UDP switch in #1443 because Kaltura only stores the email address and not other identifiers. And the UDP isn't storing non-roster user-info.

So for example these users

user_id sis_name course_id enrollment_type
10100000001234567 matt.jones@gmail.com 101000000000001 StudentEnrollment
10160000001234567 matt.jones@example.edu 101000000000001 StudentEnrollment

are both changed by the cron to be this:

user_id sis_name course_id enrollment_type
10100000001234567 matt.jones 101000000000001 StudentEnrollment
10160000001234567 matt.jones 101000000000001 StudentEnrollment

It was also noted to have duplicates in this table somehow. This table really should be distinct between all 4 columns.

So I think we need a better fix here. I believe the best solution might be to store both the sis_name "as-is" along with the as the email_address "as-is" rather than trying to manipulate it in the query to support Kaltura. Then in the code we can handle the special case for Kaltura if one of the value is blank.

Kaltura will either have {"id": "https://aakaf.mivideo.it.umich.edu/caliper/info/user/jonespm", "type": "Person", "name": "Matthew Jones", "dateCreated": "2023-11-30T22:03:05.000Z", "dateModified": "2023-12-12T15:29:38.000Z"} Or {"id": "https://aakaf.mivideo.it.umich.edu/caliper/info/user/jonespm+gmail.com", "type": "Person", "name": "Matthew Jones", "dateCreated": "2023-11-30T22:03:05.000Z", "dateModified": "2023-12-12T15:29:38.000Z"}

A temporary fix identified for this was to change the query in the cron to not strip out the email.

case
   when pe.email_address is not null then lower(pe.email_address)
   else p2.sis_ext_id end as sis_name,

We need to fully test and identify where this ID is used to ensure there's no regressions.

Steps to Reproduce :

  1. TBD - Will probably involve adding multiple sis_names and enroll them in a class to try this.
jonespm commented 8 months ago

So I've spent a lot of time looking at this and I don't see any way we can easily this using the UDP and our current cron process.

The current cron process removes the user table and information isn't fully available in the UDP for certain users as indicated on this issue comment. These users include non-instructional users (like local accounts and staff) who have a record in the UDP but it matches their Canvas SIS ID. The problem is that it needs to be the email address or login name for Kaltura. AFIAK this is nowhere in the UDP, only in CD2 in the canvas.pseudonyms table and in the LTI launch.

There's a few ways forward with this.

1) Because the lti_new captures the user credentials at login, we could preserve this value and not remove it every refresh of the cron. However the user table removed every run and is overloaded to contain information about the grades so this table would need very likely need to be refactored first. Then we could still update the grade data but not the user names. The downside of this would be that the user wouldn't see any MiVideo events until the next day so would we want to communicate that?

2) If we switch over to BigQuery we can use CD2 for the pseudonym user information directly in the cron. Maybe we'd could also use CD2 for more of the data here.

Either solution would do it, they don't go really go together. However the user table refactor could be in both. I don't think this is necessary to fix now. I believe we could commit the temporary fix but I'm not 100% sure that doesn't have any side effects for us. I think we should remove this from the next release for now as either of these 2 tasks is a pretty big effort.