Open briangrossman opened 3 years ago
Sentry issue: MICROMASTERS-4P7
@annagav can you confirm my understanding:
Do you have any theory about why the error reates for refreshing tokens would increase to more than 10,000 per day, for a few days?
We use redis to store the failed credentials for users, maybe the thing Sar pointed out few days ago is releated, that Redis is at 99% capacity.
@pdpinch I have run that task plenty of times, I did not face any issue. It's not reproducible so that should be associated with the Redis memory maxlimit
. We can't ignore it but better to test it after fixing that memory issue.
Sorry @annagav, I asked a question on the other ticket that you already answered here.
We use redis to store the failed credentials for users
Why do we do that? How long do we store them for? Is it normal behavior for there to be so many?
@pdpinch I have suggested the solution on the other ticket to resolve that memory issue. Hopefully, it will fix it too alongside. I am just waiting for @shaidar to try it out on our expected server. https://github.com/mitodl/micromasters/issues/4908
@pdpinch There is a task that runs every 6 hours called batch_update_user_data
. It tries to refresh users data but it checks to see if authentication for this user failed last three times the task ran and if so will not try to refresh it again.
We also do the same for freeze final grade task, we just store all user ids that failed authentication so that we can complete the freeze process.
So basically there is a map for user_id and number of times authentication failed. And another list of users not to update. So the size if each should not be more than the number of users in MM.
Is Redis the only place we store the users with failed authentication?
So when we drop the data in redis, we end up building it up again?
1) As far as I know, yes 2) yes, for currently failed users.
So the size if each should not be more than the number of users in MM.
In production, we have 132,654 users. Is that enough to blow out redis? Are we duplicating records between the batch_update_user_data
and freeze final grades?
So when the Redis cache is flushed, aren't we back to square one as far as populating the list of user ID's and checking if authentication for the user failed the last three times?
Yes, the maps for batch_update_user_data
and freeze_final_grades
are different, which means we are duplicating it. But Freeze final grades only runs once a semester and only for enrolled users in the courses that semester. Also when freezing completes we call delete so that memory should get freed up.
As per my findings, we are keeping the following sets of data for the users.
CACHE_KEY_FAILED_USERS_NOT_TO_UPDATE = "failed_cache_update_users_not_to_update"
that is used in updating the users but we skip that users stored in Redis cache. Which is used for user update with edX.CACHE_KEY_FAILURE_NUMS_BY_USER = "update_cache_401_failure_numbers"
as the total numbers of failures by all usersFIELD_USER_ID_BASE_STR = "user_{0}"
, total number of failures against each userCACHE_KEY_FAILED_USERS_BASE_STR = failed_users_{course_run.edx_course_key}
that is what multiple copies, one copy against each course. That is used for grading purposes with edX.CACHE_KEY_FAILED_USERS_BASE_STR
is used in the check_final_grade_freeze_status
management command too.
So, with that multiple copies against each user, that makes our redis
memory hit its limit. So if we keep adding the courses, that will keep adding more memory consumption for redis.
Would it be hard to keep just one list of failed authentications, and stop tracking the failed authentications per course?
I have other thoughts about this, but they involve major changes to the APIs we are using to gather user data. I’m hoping that a change in the redis storage would be quicker.
As per my findings, we are keeping two sets of data for users. failed authentications users as a whole failed authentications users against each course failedusers{course_run.edx_course_key} that is what multiple copies, one copy against each course. So, with that multiple copies against each user, that makes our redis memory hit its limit.
The main cause of that error is happening while updating the users in the celery tasks its try to refresh the token for the expired token for user authentication. But it's not authenticating for some users that might be because of either invalid credentials / the user has been updated on edX site.
Would it be hard to keep just one list of failed authentications, and stop tracking the failed authentications per course? I have other thoughts about this, but they involve major changes to the APIs we are using to gather user data. I’m hoping that a change in the redis storage would be quicker. … As per my findings, we are keeping two sets of data for users. failed authentications users as a whole failed authentications users against each course failedusers{course_run.edx_course_key} that is what multiple copies, one copy against each course. So, with that multiple copies against each user, that makes our Redis memory hit its limit.
I had updated my findings comment even after your comment so it would be nice if you could re-look at it. I am gonna spend some more time on the next working day to find the ultimate solution of how could we optimize it while considering your comment in mind.
authentications
Yes, that's possible. We can remove that course-specific data.
In the past four days, we've seen about 75,000
InvalidCredentialStored dashboard.tasks.batch_update_user_data_subtasks
errors in sentry (link)Steps
Stacktrace