Closed felipemontoya closed 10 months ago
fyi @Ian2012
We should discuss the desired outcomes of such an operation. The default system doesn't store this information for this reason, but consistency of downstream data will need to be considered and decisions about where it's ok to keep aggregated or obfuscated data will need to be made before implementing anything.
I think that is very reasonable. Discuss and decide before implementing anything.
In my view, the fact that the system does not store PII information is a great starting point. All info from the xAPI statements is already safe to store as it is already anonymized.
For in campus usage however having only anonymized data is not enough, mostly teacher want to be able to understand trends in the course so they can still have an effect on individual students that are at risk of failing a course. To achieve this we can turn on the optional event_sink.external_id
table from PR#43. This sink dumps data from the ExternalId model into clickhouse. According to the code it copies:
"external_user_id",
"external_id_type",
"username",
"user_id",
"dump_id",
"time_last_dumped",
Now the thing is that even if for campus usage the teachers get access to de-anonymized data. This should not go on forever. Specially if the user deletes its account, the PII info should not longer be accesible. I understand this as simply deleting the row in the sink table in clickhouse (the LMS retirement process already does more than this) but as clickhouse mostly dumps new data in, then as it works now the old records would not be deleted when the row in the LMS is deleted.
The only desired outcome I propose in this issue is that the corresponding row in the sink table is deleted or obfuscated. Any other xAPI statement or DBT processed facts table should remain untouched.
Right, the problems I can foresee with this are:
I'm sure there are other small issues, but I want to make sure that you're aware that just deleting those rows isn't a sufficient fix on its own. I don't see the delete itself as a huge problem unless there are a lot of users in the system. Currently at most I think we would have 3 rows per user in the LMS users database. If a lot of users retired in a short period of time it might cause performance issues, but probably limited to the tables / reports that reference the external id or tables downstream of it directly.
We will very likely want to combine the data from the external id table in ClickHouse into downstream tables for performance reasons, meaning more tables will need to be touched (deleted from / updated / rebuilt) to clean up the PII
I had not considered this. I imagined that we already had the data anonymized so we would do all calculations with the actor_id and if and when someone wanted to de-anonymize then that would be done joining the profile table right in the graph query.
Right now I'm thinking about two possible paths worth exploring. The first would be to double down and make an obfuscation transformation so that we can alter the user_profile
table as well. In this case I'm no longer sure if removing the actor_id to user_id mapping is necessary given. Probably that would also have limitations as there might be cases where the obfuscation does not propagate well into the downstream tables.
The other path I can imagine at the moment would be to separate the table user_profile table into the PII fields and the non_PII fields. And then we could propose guidelines that recommend that PII data is not combined in downstream tables while maintaining the capability to further process things like language, gender or country.
-If not carefully considered, operations on those downstream tables may cause data to change, confusing users -If any downstream tables need to be rebuilt (for example if a new column is added in an update) and they inner join back to the external id table which no longer has rows for retired users, the data contained in them will change, confusing users
These two are related to confusing users which I think can be mitigated if we are doing obfuscation instead of deleting the rows.
If a lot of users retired in a short period of time it might cause performance issues
Do you think this risk warrants doing some rate limiting in the lms? for instance in the task that is dispatched when the user_profile changes.
We can mitigate some of the impact on database performance by creating a command to re-generate the entire MVs or to batch update the records in downstream tables and do it in batches for multiple users if needed
I'm talking with Altinity about this now to get their advice. Rebuilding MV tables could result in lengthy downtimes once the database has significant data in it. One possibility is to use a DICTIONARY type to cache PII data in memory like we do for course data. That should be performant enough to join on, and act as a buffer to updates on the base PII tables, at the cost of using more ClickHouse memory.
I'll let you know what Altinity suggests.
So based on their feedback I think this is our best option:
ASPECTS_EVENT_SINK_USER_PROFILE_TABLE
to partition data by user_id % 100
(creating 100 partitions) so that when users are being obfuscated or deleted ClickHouse is only modifying on one partition instead of the whole table.complex_key_sparse_hashed
) combining those tables that does lookups into the PII with anonymous id as the key and a reasonable LIFETIME, which will cause the entire dataset to (non-blocking) reload in the background every X secondsThis makes a few tradeoffs:
I would guess that the retirement steps would live in event-sink-clickhouse and be run as part of LMS retirement. It would have to run the mutation logic in ClickHouse, which could take a very long time. I'm not sure off hand if there's a way to fire those off async, or if we'd rather make Celery tasks for them. There are definitely concerns about Celery tasks failing invisibly and users not being retired.
I'm thinking of tackling this with the following steps.
Comments/corrections/suggestions @bmtcril @felipemontoya @Ian2012 ?
Clickhouse/Alembic migrations in https://github.com/openedx/tutor-contrib-aspects/pull/529:
ASPECTS_EVENT_SINK_USER_PROFILE_TABLE
by userid % 100
(configurable)external_ids
table (https://github.com/openedx/tutor-contrib-aspects/pull/521/) by userid % 100
external_id
, which refreshes every 15m (configurable).
Join to this dictionary for all external_id or PII dataset/chart queries (worth an ADR).Retire user with openedx-event-sink-clickhouse:
USER_RETIRE_LMS_MISC
signal, queue request to retire user data in Clickhouse.
Records will be deleted from the external_id
and user_profile
tables; dictionary will refresh with these changes automatically.
Optimize the affected tables + partitions - ?
Management command provided to manually retire a given user.Monitor event sink/user retirements:
This all sounds great, my expectation is that if the ASPECTS_ENABLE_PII
flag is off the table and dictionary will be empty, making my main concern about memory usage mostly moot. I think we will want a separate flag for whether retirement runs, defaulting to the same as ASPECTS_ENABLE_PII
, just in case someone turns the flag on and then back off for some reason and still needs to retire data. Or perhaps it's just always on, I just want to minimize the chances of there being orphaned PII in the database.
@bmtcril I'm leaning towards leaving the "retire user" event sink on all the time, since User Retirement is in the platform by default. If ASPECTS_ENABLE_PII
is false, then there won't be any data to delete, but that should run quickly. And if someone does toggle ASPECTS_ENABLE_PII
on and off, then user retirement will still happen.
Makes sense to me, and it's one less setting to manage. 👍
@bmtcril I think this issue can be resolved?
I believe so, @felipemontoya does the completed work meet your needs?
I believe this is set now, closing.
Now that we have an optional sink to transfer exernal_user_ids to Clickhouse, we would need to keep this table up to date. When a user is retired in the LMS, the view destroys the PII information for the user. https://github.com/openedx/edx-platform/blob/a313b3fd20b4e12d3d1b0b35a0c5eaef1d0cc771/openedx/core/djangoapps/user_api/accounts/views.py#L1175
However the update should also go through to clickhouse even if the delete operation might be costly.
This issue came to my attention thanks to the Spanish universities when discussing aspects and GDPR.