Race conditions at scale leading to security issues.

parikls commented 2 months ago

My use case:

Users are logging in via an SSO, and after that I'm using the access token of the logged-in user to obtain additional information from the provider. In my view I'm getting the access token using the provider property, e.g. provider.access_token. But when I'm running the code on a big scale (we have hundreds of thousands users) - sometimes one user can obtain the access token of the other user, which leads to serious security issues. This happens rarely, and only if multiple users are logging-in at the same time (microseconds differences). After investigating the source codes of the fastapi-sso, I've figured out that the problem is actually with the provider implementation. SSOBase class is stateful and a process_login coro has multiple awaits after fetching the token, so when there are multiple simultaneous users are logging in via the sso - this leads to a race condition, and the first user will obtain the token of the last user.

My code which I'm currently using.

lifespan.py


def create_my_sso_provider(app: FastAPI):
    discovery = {
        "authorization_endpoint": '...',
        "token_endpoint": '...',
        "userinfo_endpoint": '...',
    }
    provider = create_provider(
        name='provider,
        discovery_document=discovery,
        response_convertor=openid_from_response
    )

    app.teachable_sso = provider(
        client_id=...,
        client_secret=...,
        redirect_uri=...,
        scope=...,
    )

api.py

async def my_view(...):
    sso_user: OpenID = await app.teachable_sso.verify_and_process(request, redirect_uri=...)
    access_token, refresh_token = teachable_sso.access_token, teachable_sso.refresh_token
    await do_request_to_sso_provider_to_fetch_more_data(access_token)

I've wrote a very quick and dirty test case for the fastapi-sso to prove that this is actually an issue:

import asyncio
from unittest.mock import Mock, patch

from starlette.datastructures import URL

from fastapi_sso import create_provider

async def test__race_condition():
    discovery = {
        "authorization_endpoint": "https://authorization_endpoint",
        "token_endpoint": "https://token_endpoint",
        "userinfo_endpoint": "https://userinfo_endpoint",
    }
    factory = create_provider(name='provider', discovery_document=discovery)

    provider = factory(client_id='client_id', client_secret='client_secret')

    # mock for the response. will return a token which we've set
    class Response:

        def __init__(self, token: str):
            self.token = token

        def json(self):
            return {
                'refresh_token': self.token,
                'access_token': self.token,
                'id_token': self.token,
            }

    # mock of the httpx client
    class AsyncClient:

        post_responses = []  # list of the responses which a client will return for the `POST` requests

        def __init__(self):
            self.headers = {}

        async def __aenter__(self):
            return self

        async def __aexit__(self, exc_type, exc_val, exc_tb):
            pass

        async def post(self, *args, **kwargs):
            await asyncio.sleep(0)  # simulate a loop switch cos it'll happen during a real HTTP request
            return self.post_responses.pop()

        # we actually don't care what GET will return for this particular test,
        # but this method is required to fully run the `process_login` code
        async def get(self, *args, **kwargs):
            await asyncio.sleep(0)
            return Response(token='')

    with patch('fastapi_sso.sso.base.httpx') as httpx:
        httpx.AsyncClient = AsyncClient

        first_response = Response(token='first_token')
        second_response = Response(token='second_token')

        AsyncClient.post_responses = [second_response, first_response]  # reversed order because of `pop`

        async def process_login():
            # this coro will be executed concurrently.
            # completely not caring about the params
            request = Mock()
            request.url = URL('https://url.com?state=state&code=code')
            await provider.process_login(
                code='code',
                request=request,
                params=dict(state='state'),
                convert_response=False
            )
            return provider.access_token

        # process login concurrently twice
        tasks = [
            process_login(),
            process_login()
        ]
        results = await asyncio.gather(*tasks)

        # we would want to get the first and second tokens,
        # but we see that the first request actually obtained the second token as well
        assert results == [first_response.token, second_response.token]

tomasvotava commented 2 months ago

Hi @parikls, thank you so much for reporting this, I was foolishly convinced this was resolved by introducing the context manager (you should be using with provider: in your case), but I never honestly thought about requests happening asynchronously like this. I am currently AFK, but tomorrow I'll fix it and finally make the sso class state-less, which was my will all along. Thanks!

parikls commented 2 months ago

@tomasvotava the context manager won't help in this case unfortunately. Currently the workaround which works - is to create a new provider for each new request. Luckily I don't see a big difference in a performance/mem usage of our API containers after introducing this approach (it's live for 2 days already). But definitely would be better to solve this properly, instead of creating a new instances of provider, oauthlib classes, etc.

def create_my_provider():
    provider = create_provider(
        name=...,
        discovery_document=...,
        response_convertor=...
    )

    return provider(
        client_id=...,
        client_secret=...,
        redirect_uri=...,
        scope=...,
    )

async def my_view(sso_provider: Depends(create_my_provider)):
    ...

Thanks in advance!

P.S. I think I can confirm that the above approach solved the issue completely. At least we had the same amount of requests during the weekend, and 0 errors in a sentry.

parikls commented 2 months ago

@tomasvotava UPD: faced the same issue yesterday even with the provider creation per request. Probably some more shared state exist somewhere? Maybe on a class level? I haven't yet investigated this deeply, but want to keep you updated

tomasvotava commented 2 months ago

@parikls thanks a lot for keeping me up to date, I am afraid I cannot post a fix without changing the behavior of how access_token is retrieved (it cannot be on the instance if the instance is to be reusable), and so I will have to make some changes. If you are using the generic sso provider, it should always create a whole new class when you call create_provider, it seems weird to me that there should be any shared state and if there is, this bug actually goes even deeper :fearful: I double-checked the static attributes set on BaseSSO and the oauth client really gets recreated for each instance, how exactly do you catch this bug in a sentry?

parikls commented 2 months ago

@tomasvotava my SSO provider has a /me endpoint which allows me to get the user info, e.g. email, for the provided access token. So I'm using the access token to fetch the user email, and comparing that email with the OpenID.email (returned from verify_and_process). Simplified code looks like this:

sso_user: OpenID = await verify_and_process(
        request,
        redirect_uri=...
    )
access_token, refresh_token = my_sso.access_token, my_sso.refresh_token

me = await custom_provider_api.get_me(access_token=access_token)

if me['email'] != sso_user.email:
    logger.error('... error goes here ...')

tomasvotava commented 2 months ago

Ok, thanks for clarifying that, I was really hoping there was an error on your side, to be honest :smile: I know security isn't laughable thing, I assure you I take it seriously, I am just trying to gather as much intel as I can before moving on to solution, so that I can be sure that I actually resolve it properly this time. You are being very, very helpful, thanks!

parikls commented 2 months ago

@tomasvotava np =) Right now I'm going to utilize a simple asyncio lock, so we'll process an SSO callbacks on each container one by one, e.g.

from asyncio import Lock

LOGIN_LOCK = Lock()

async def my_view(...):
    async with LOGIN_LOCK:
        sso_user: OpenID = await verify_and_process(request, redirect_uri=...)
        access_token, refresh_token = teachable_sso.access_token, teachable_sso.refresh_token
    ...

And will post an update in a day or two

parikls commented 2 months ago

@tomasvotava FYI after adding a lock - issue is fully resolved

tomasvotava commented 2 months ago

@parikls thanks for letting me know, my planned solution does basically the same, but it will require everyone to change from using with provider: to async with provider:, I hope it's going to be alright for everyone, but I'd say security is more important than convenience.

tomasvotava / fastapi-sso

Race conditions at scale leading to security issues. #186