Race condition in getting app access token leads to BadCredentialsException

sentry-io[bot] commented 4 years ago

BadCredentialsException: 401 {'message': 'Bad credentials', 'documentation_url': 'https://developer.github.com/v3'}
(5 additional frame(s) were not displayed)
...
  File "packit/config/package_config.py", line 311, in get_package_config_from_repo
    path=config_file_name, ref=ref
  File "ogr/services/github/project.py", line 400, in get_file_content
    path=path, ref=ref
  File "github/Repository.py", line 1514, in get_contents
    parameters=url_parameters
  File "github/Requester.py", line 276, in requestJsonAndCheck
    return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url)))
  File "github/Requester.py", line 287, in __check
    raise self.__createException(status, responseHeaders, output)

This can be reliably reproduced by running the following script in 4 parallel processes:

import sys
import github

private_key_file = sys.argv[1]
app_id = int(sys.argv[2])

with open(private_key_file, 'r') as fp:
    private_key = fp.read()

def get_instance():
    integration = github.GithubIntegration(app_id, private_key)
    inst_id = integration.get_installation("packit-service", "hello-world").id.value
    integration = github.GithubIntegration(app_id, private_key)
    inst_id = integration.get_installation("packit-service", "hello-world").id.value
    inst_auth = integration.get_access_token(inst_id)
    print(inst_auth.expires_at)
    instance = github.Github(login_or_token=inst_auth.token)
    return instance

def get_repo(instance):
    repo = instance.get_repo(full_name_or_id="packit-service/hello-world")

while True:
    inst = get_instance()
    get_repo(inst)

csomh commented 4 years ago

I'll look into this.

csomh commented 4 years ago

With the PR above this should happen less often, but we'll need to wait for a time when more builds are happening at once.

I plan to revisit this issue after the architecture discussions next week.

sentry-io[bot] commented 4 years ago

Sentry issue: RED-HAT-0P-2J0

stale[bot] commented 4 years ago

This issue has been marked as stale because it hasn't seen any activity for the last 60 days.

Stale issues are closed after 14 days, unless the label is removed by a maintainer or someone comments on it.

This is done in order to ensure that open issues are still relevant.

Thank you for your contribution! :unicorn: :rocket: :robot:

(Note: issues labeled with pinned, security, bug or EPIC are never marked as stale.)

csomh commented 4 years ago

@TomasTomecek I moved this for us to discuss the possible solutions next Tue and to tackle it next sprint. Breaking up the tasks lead to the access token being requested more frequently, and so these error became more frequent. fyi @dhodovsk

@packit-service/the-packit-team This issue would need some attention and ideas for how to solve it.

TomasTomecek commented 4 years ago

oh, so this is the for so many 401s in sentry

yup, we should absolutely prioritize this

csomh commented 4 years ago

Some thoughts on this:

Tokens to access repositories are retrieved in ogr. This happens once for every GithubProject object. Most (all?) of the Celery tasks create their GithubProject object to interact with the repositories.

There is one token per repository.

With multiple workers picking tasks from the queue, there can be multiple tasks interacting with the same repository (not necessarily with the same PR).

Whenever a new token is retrieved for a repository the previous token is invalidated. If not made invalid, tokens are valid for an hour.

The race condition happens when:

Task A asks for and receives a token t1 for repository X.
Task B asks for and receives a token t2 for repository X. Token t1 becomes invalid.
Task A uses token t1 to interact with repository X.
Task A fails with BadCredentialsException.

To solve this we have the following options:

Create a central place to retrieve, store and renew tokens. Tasks should get the tokens for repositories they work with from this place, and ogr should be able to use this token when present instead of requesting a new one.
any other ideas? :thinking:

TomasTomecek commented 4 years ago

+1 I'd store them in DB (redis/psql, doesn't matter) and let p-s handle the token management. I'd say this is secure-ish since they last only an hour as you say.

csomh commented 4 years ago

tokman is deployed to prod, so this should happen less.

packit / packit-service

Race condition in getting app access token leads to BadCredentialsException #595