supabase / supavisor

A cloud-native, multi-tenant Postgres connection pooler.
https://supabase.github.io/supavisor/
Apache License 2.0
1.66k stars 55 forks source link

Supavisor brings down infrastructure after role modification #319

Open kdaniel21 opened 4 months ago

kdaniel21 commented 4 months ago

Bug report

Describe the bug

After modifying a role that was previously used to connect to the database using Supavisor, Supavisor enters an infinite loop, effectively blocking all new connections. Seemingly, every new connection enters another infinite loop, bringing the vast majority of the Supabase infrastructure down. As far as I observed, there is no path to recovery as long as one retries accessing the blocked parts (because those start infinite loops again). If there are no attempts to use any of the roles blocked, it appears that the retry is shut down after an undefined amount of time.

This problem in Supavisor seems to have already been recognized and fixed. However, I haven't found a way to interact with Supabase's Supavisor instance to use the API mentioned in the PR.

To make matters even worse, one can revoke/alter a role and not make any new connections for that role. This is highly speculative, however, it seems that after a certain timeout, Supavisor tries to destroy the (idle) connection/pool, and detects that the role does not work anymore**, and enters the exact same infinite loop.

*Interestingly, not all roles are blocked from connecting (and enter the infinite loop when doing so). In our case, on top of the role that was modified, postgres and supabase_storage_admin got "blocked", but e.g. PostgREST kept functioning, so possibly anon and authenticated was still working - provided that PostgREST uses Supavisor to connect and not a direct connection, but I'm doubting that.

**I don't have a deep enough knowledge of Supavisor to know why exactly this happened, but the following led me to think that it's happening.

Logs ``` 2024-03-07T19:46:48.429Z: DbHandler: Terminating with reason :client_termination when state was :idle 2024-03-07T19:46:48.432Z: DbHandler: Error auth response ["SFATAL", "VFATAL", "C28P01", "Mpassword authentication failed for user \"\"", "Fauth.c", "L326", "Rauth_failed"] 2024-03-07T19:46:48.434Z: DbHandler: Connection closed when state was authentication ``` And after this, the infinite retry started (unfortunately, I didn't save the logs for that part). Before this, the last connection using this particular role was around ~19:15, so if Supavisor had a timeout of 30 minutes, that'd align rather well with the observation above.

To Reproduce

  1. Set up a new Supabase project
  2. Create a new role in the DB
  3. Connect using the new role through Supavisor
  4. Drop the role
  5. Try performing any other operation through the "old" connection (using the dropped role)

Expected behavior

Supavisor should not enter an infinite loop, or there should be a way to stop the retry, or tell Supavisor that a particular role changed (as the fix was implemented in Supavisor).

saltcod commented 4 months ago

Thanks @kdaniel21! Going to transfer this one to the supavisor repo.

kdaniel21 commented 4 months ago

@saltcod I'm sorry then for opening it in the repo! My way of thinking was that the bug has (seemingly) already been addressed/fixed in Supavisor, there is just simply no way to interact with the provided API through the Supabase dashboard. That's why I assumed that the Supabase repo is the most appropriate.

kdaniel21 commented 3 months ago

Hi, is there any known workarounds to at least recover from this state? This has been affecting us ever since, and I'm still concerned that it will once reach our production environment, leaving us with nothing but waiting that it will recover sometime.

d-e-h-i-o commented 3 months ago

This also happened after this statement:

create user mapping for example_role
server app_db
options (user 'example', password 'wrong_password')

The 'example' user does exist, but the password is wrong.

This bug is quite concerning to use, since it has the potential to bring down production, with no clear path to recovery.