openedx / tutor-contrib-aspects

The Open Analytics Reference System - Tutor plugin
Apache License 2.0
10 stars 15 forks source link

Connect the exernal_user_ids to the User Retirement Process #449

Closed felipemontoya closed 10 months ago

felipemontoya commented 1 year ago

Now that we have an optional sink to transfer exernal_user_ids to Clickhouse, we would need to keep this table up to date. When a user is retired in the LMS, the view destroys the PII information for the user. https://github.com/openedx/edx-platform/blob/a313b3fd20b4e12d3d1b0b35a0c5eaef1d0cc771/openedx/core/djangoapps/user_api/accounts/views.py#L1175

However the update should also go through to clickhouse even if the delete operation might be costly.

This issue came to my attention thanks to the Spanish universities when discussing aspects and GDPR.

felipemontoya commented 1 year ago

fyi @Ian2012

bmtcril commented 1 year ago

We should discuss the desired outcomes of such an operation. The default system doesn't store this information for this reason, but consistency of downstream data will need to be considered and decisions about where it's ok to keep aggregated or obfuscated data will need to be made before implementing anything.

felipemontoya commented 1 year ago

I think that is very reasonable. Discuss and decide before implementing anything.

In my view, the fact that the system does not store PII information is a great starting point. All info from the xAPI statements is already safe to store as it is already anonymized.

For in campus usage however having only anonymized data is not enough, mostly teacher want to be able to understand trends in the course so they can still have an effect on individual students that are at risk of failing a course. To achieve this we can turn on the optional event_sink.external_id table from PR#43. This sink dumps data from the ExternalId model into clickhouse. According to the code it copies:

            "external_user_id",
            "external_id_type",
            "username",
            "user_id",
            "dump_id",
            "time_last_dumped",

Now the thing is that even if for campus usage the teachers get access to de-anonymized data. This should not go on forever. Specially if the user deletes its account, the PII info should not longer be accesible. I understand this as simply deleting the row in the sink table in clickhouse (the LMS retirement process already does more than this) but as clickhouse mostly dumps new data in, then as it works now the old records would not be deleted when the row in the LMS is deleted.

The only desired outcome I propose in this issue is that the corresponding row in the sink table is deleted or obfuscated. Any other xAPI statement or DBT processed facts table should remain untouched.

bmtcril commented 1 year ago

Right, the problems I can foresee with this are:

I'm sure there are other small issues, but I want to make sure that you're aware that just deleting those rows isn't a sufficient fix on its own. I don't see the delete itself as a huge problem unless there are a lot of users in the system. Currently at most I think we would have 3 rows per user in the LMS users database. If a lot of users retired in a short period of time it might cause performance issues, but probably limited to the tables / reports that reference the external id or tables downstream of it directly.

felipemontoya commented 1 year ago

We will very likely want to combine the data from the external id table in ClickHouse into downstream tables for performance reasons, meaning more tables will need to be touched (deleted from / updated / rebuilt) to clean up the PII

I had not considered this. I imagined that we already had the data anonymized so we would do all calculations with the actor_id and if and when someone wanted to de-anonymize then that would be done joining the profile table right in the graph query.

Right now I'm thinking about two possible paths worth exploring. The first would be to double down and make an obfuscation transformation so that we can alter the user_profile table as well. In this case I'm no longer sure if removing the actor_id to user_id mapping is necessary given. Probably that would also have limitations as there might be cases where the obfuscation does not propagate well into the downstream tables.

The other path I can imagine at the moment would be to separate the table user_profile table into the PII fields and the non_PII fields. And then we could propose guidelines that recommend that PII data is not combined in downstream tables while maintaining the capability to further process things like language, gender or country.

-If not carefully considered, operations on those downstream tables may cause data to change, confusing users -If any downstream tables need to be rebuilt (for example if a new column is added in an update) and they inner join back to the external id table which no longer has rows for retired users, the data contained in them will change, confusing users

These two are related to confusing users which I think can be mitigated if we are doing obfuscation instead of deleting the rows.

If a lot of users retired in a short period of time it might cause performance issues

Do you think this risk warrants doing some rate limiting in the lms? for instance in the task that is dispatched when the user_profile changes.

Ian2012 commented 1 year ago

We can mitigate some of the impact on database performance by creating a command to re-generate the entire MVs or to batch update the records in downstream tables and do it in batches for multiple users if needed

bmtcril commented 1 year ago

I'm talking with Altinity about this now to get their advice. Rebuilding MV tables could result in lengthy downtimes once the database has significant data in it. One possibility is to use a DICTIONARY type to cache PII data in memory like we do for course data. That should be performant enough to join on, and act as a buffer to updates on the base PII tables, at the cost of using more ClickHouse memory.

I'll let you know what Altinity suggests.

bmtcril commented 1 year ago

So based on their feedback I think this is our best option:

  1. Update the ASPECTS_EVENT_SINK_USER_PROFILE_TABLE to partition data by user_id % 100 (creating 100 partitions) so that when users are being obfuscated or deleted ClickHouse is only modifying on one partition instead of the whole table.
  2. Do the same for the anonymous id table. I'm not sure where you're getting it from, it doesn't seem to be in Aspects.
  3. Make a dictionary (probably complex_key_sparse_hashed) combining those tables that does lookups into the PII with anonymous id as the key and a reasonable LIFETIME, which will cause the entire dataset to (non-blocking) reload in the background every X seconds
  4. Any to access of the PII must go through that dictionary, it gets joined in at report query time and should never be stored in a materialized view or downstream table
  5. Now when we get retirement requests we can use the update mutations on the base tables with PII without severe impact to performance. The tables will only be queried at dictionary refresh time and mutations will only be working on one shard at a time, minimizing the potential time blocking dictionary access to 1% of the data.

This makes a few tradeoffs:

  1. A dictionary with millions of users will take up significant RAM in ClickHouse, but joining on it will be very fast
  2. Not as fast as pre-baking the PII into a downstream table would be, though
  3. New user data will be delayed by up to the lifetime of the dictionary, so the setting of the lifetime is important to balance freshness vs performance, I would probably start with 15 mins and see how it feels

I would guess that the retirement steps would live in event-sink-clickhouse and be run as part of LMS retirement. It would have to run the mutation logic in ClickHouse, which could take a very long time. I'm not sure off hand if there's a way to fire those off async, or if we'd rather make Celery tasks for them. There are definitely concerns about Celery tasks failing invisibly and users not being retired.

pomegranited commented 1 year ago

I'm thinking of tackling this with the following steps.

Comments/corrections/suggestions @bmtcril @felipemontoya @Ian2012 ?

Clickhouse/Alembic migrations in https://github.com/openedx/tutor-contrib-aspects/pull/529:

Retire user with openedx-event-sink-clickhouse:

Monitor event sink/user retirements:

bmtcril commented 1 year ago

This all sounds great, my expectation is that if the ASPECTS_ENABLE_PII flag is off the table and dictionary will be empty, making my main concern about memory usage mostly moot. I think we will want a separate flag for whether retirement runs, defaulting to the same as ASPECTS_ENABLE_PII, just in case someone turns the flag on and then back off for some reason and still needs to retire data. Or perhaps it's just always on, I just want to minimize the chances of there being orphaned PII in the database.

pomegranited commented 12 months ago

@bmtcril I'm leaning towards leaving the "retire user" event sink on all the time, since User Retirement is in the platform by default. If ASPECTS_ENABLE_PII is false, then there won't be any data to delete, but that should run quickly. And if someone does toggle ASPECTS_ENABLE_PII on and off, then user retirement will still happen.

bmtcril commented 12 months ago

Makes sense to me, and it's one less setting to manage. 👍

pomegranited commented 11 months ago

@bmtcril I think this issue can be resolved?

bmtcril commented 11 months ago

I believe so, @felipemontoya does the completed work meet your needs?

bmtcril commented 10 months ago

I believe this is set now, closing.