mitodl / micromasters

Portal for learners and course teams to access MITx Micromasters® programs
https://mm.mit.edu
BSD 3-Clause "New" or "Revised" License
29 stars 17 forks source link

Sentry issue: InvalidCredentialStored dashboard.tasks.batch_update_user_data_subtasks #4904

Open briangrossman opened 3 years ago

briangrossman commented 3 years ago

In the past four days, we've seen about 75,000 InvalidCredentialStored dashboard.tasks.batch_update_user_data_subtasks errors in sentry (link)

Steps

Stacktrace

InvalidCredentialStored
Received a 400 status code from the OAUTH server

------------------------------------------------------------------
backends/utils.py in _send_refresh_request at line 23

    strategy = load_strategy()
    try:
        user_social.refresh_token(strategy)
    except HTTPError as exc:
        if exc.response.status_code in (400, 401,):
            raise InvalidCredentialStored(   <============
                message='Received a {} status code from the OAUTH server'.format(
                    exc.response.status_code),
                http_status_code=exc.response.status_code
            )
        raise
------------------------------------------------------------------
backends/utils.py in refresh_user_token at line 47
------------------------------------------------------------------
dashboard/api.py in refresh_user_data at line 732
------------------------------------------------------------------
sentry-io[bot] commented 3 years ago

Sentry issue: MICROMASTERS-4P7

pdpinch commented 3 years ago

@annagav can you confirm my understanding:

  1. This error is originating with the batch_update_user_data task
  2. After 3 failed attempts to update a user's data, we stop trying?

Do you have any theory about why the error reates for refreshing tokens would increase to more than 10,000 per day, for a few days?

annagav commented 3 years ago

We use redis to store the failed credentials for users, maybe the thing Sar pointed out few days ago is releated, that Redis is at 99% capacity.

umarmughal824 commented 3 years ago

@pdpinch I have run that task plenty of times, I did not face any issue. It's not reproducible so that should be associated with the Redis memory maxlimit. We can't ignore it but better to test it after fixing that memory issue.

pdpinch commented 3 years ago

Sorry @annagav, I asked a question on the other ticket that you already answered here.

We use redis to store the failed credentials for users

Why do we do that? How long do we store them for? Is it normal behavior for there to be so many?

umarmughal824 commented 3 years ago

@pdpinch I have suggested the solution on the other ticket to resolve that memory issue. Hopefully, it will fix it too alongside. I am just waiting for @shaidar to try it out on our expected server. https://github.com/mitodl/micromasters/issues/4908

annagav commented 3 years ago

@pdpinch There is a task that runs every 6 hours called batch_update_user_data. It tries to refresh users data but it checks to see if authentication for this user failed last three times the task ran and if so will not try to refresh it again. We also do the same for freeze final grade task, we just store all user ids that failed authentication so that we can complete the freeze process. So basically there is a map for user_id and number of times authentication failed. And another list of users not to update. So the size if each should not be more than the number of users in MM.

pdpinch commented 3 years ago

Is Redis the only place we store the users with failed authentication?

So when we drop the data in redis, we end up building it up again?

annagav commented 3 years ago

1) As far as I know, yes 2) yes, for currently failed users.

pdpinch commented 3 years ago

So the size if each should not be more than the number of users in MM.

In production, we have 132,654 users. Is that enough to blow out redis? Are we duplicating records between the batch_update_user_data and freeze final grades?

shaidar commented 3 years ago

So when the Redis cache is flushed, aren't we back to square one as far as populating the list of user ID's and checking if authentication for the user failed the last three times?

annagav commented 3 years ago

Yes, the maps for batch_update_user_data and freeze_final_grades are different, which means we are duplicating it. But Freeze final grades only runs once a semester and only for enrolled users in the courses that semester. Also when freezing completes we call delete so that memory should get freed up.

umarmughal824 commented 3 years ago

As per my findings, we are keeping the following sets of data for the users.

  1. failed authentications users as a whole CACHE_KEY_FAILED_USERS_NOT_TO_UPDATE = "failed_cache_update_users_not_to_update" that is used in updating the users but we skip that users stored in Redis cache. Which is used for user update with edX.
  2. CACHE_KEY_FAILURE_NUMS_BY_USER = "update_cache_401_failure_numbers" as the total numbers of failures by all users
  3. FIELD_USER_ID_BASE_STR = "user_{0}", total number of failures against each user
  4. failed (updating edX users grades cache/user does not have any grade) against each course CACHE_KEY_FAILED_USERS_BASE_STR = failed_users_{course_run.edx_course_key} that is what multiple copies, one copy against each course. That is used for grading purposes with edX.

CACHE_KEY_FAILED_USERS_BASE_STR is used in the check_final_grade_freeze_status management command too.

So, with that multiple copies against each user, that makes our redis memory hit its limit. So if we keep adding the courses, that will keep adding more memory consumption for redis.

pdpinch commented 3 years ago

Would it be hard to keep just one list of failed authentications, and stop tracking the failed authentications per course?

I have other thoughts about this, but they involve major changes to the APIs we are using to gather user data. I’m hoping that a change in the redis storage would be quicker.

As per my findings, we are keeping two sets of data for users. failed authentications users as a whole failed authentications users against each course failedusers{course_run.edx_course_key} that is what multiple copies, one copy against each course. So, with that multiple copies against each user, that makes our redis memory hit its limit.

umarmughal824 commented 3 years ago

The main cause of that error is happening while updating the users in the celery tasks its try to refresh the token for the expired token for user authentication. But it's not authenticating for some users that might be because of either invalid credentials / the user has been updated on edX site.

umarmughal824 commented 3 years ago

Would it be hard to keep just one list of failed authentications, and stop tracking the failed authentications per course? I have other thoughts about this, but they involve major changes to the APIs we are using to gather user data. I’m hoping that a change in the redis storage would be quicker. As per my findings, we are keeping two sets of data for users. failed authentications users as a whole failed authentications users against each course failedusers{course_run.edx_course_key} that is what multiple copies, one copy against each course. So, with that multiple copies against each user, that makes our Redis memory hit its limit.

I had updated my findings comment even after your comment so it would be nice if you could re-look at it. I am gonna spend some more time on the next working day to find the ultimate solution of how could we optimize it while considering your comment in mind.

umarmughal824 commented 3 years ago

authentications

Yes, that's possible. We can remove that course-specific data.