pow-auth / pow

Robust, modular, and extendable user authentication system
https://powauth.com
MIT License
1.58k stars 152 forks source link

Persistence of stale session ids in MnesiaCache #713

Closed gVirtu closed 9 months ago

gVirtu commented 9 months ago

Hello all! :)

I have recently noticed a peculiar scenario in an API using Mnesia-backed Pow in production which I'm interested in reproducing to get to the bottom of it, but I've had no success thus far.

Due to reasons yet unknown some sessions are seemingly persisting in the cache even after their TTL has passed.

We have set up an authentication plug similar to the one in "How to use Pow in an API" tutorial and we are using Pow.Store.Backend.MnesiaCache as our cache_store_backend. Under normal circumstances, creating/renewing/deleting sessions via the plug works just as expected, as well as having them expire automatically. Even though the application is containerized, mnesia data is kept between versions in an attached persistent volume. I'm currently using this snippet to check active sessions for an user:

iex> config = [pow_config: [otp_app: :app], backend: Pow.Store.Backend.MnesiaCache]
iex> {:ok, user} = get_user(...)
iex> sessions = Pow.Store.CredentialsCache.sessions(config, user)
[...] # session_ids

Testing with a newly-created user, this list behaves normally, having a session id inserted after signing in, replaced after renewal, and deleted after signing out or after 30 minutes. However, for some of the earlier users, this list has been steadily growing over the months, even reaching hundreds of session ids. Luckily, these sessions cannot be used for authentication, since Pow.Store.CredentialsCache.get/1 returns :not_found.

Reproducing these steps in CredentialsCache, the first call for Base.get returns the expected {user_key, metadata}, but the second one returns :not_found. Manually assembling the session_key ([struct, :user, id, :session, session_id]) I was also able to verify that the timestamps are also still present in the cache.

Next, I tried to inspect the current state of MnesiaCache as well using the code below:

iex> Process.whereis(Pow.Store.Backend.MnesiaCache) |> :sys.get_state()
%{
  invalidators: %{
    ...
  }
}

And found that all of the stale sessions still had invalidators in the state, and the call to Process.read_timer/1 to each of their References returned false which suggests they should be expired. Then I went to check the deletion logic in MnesiaCache and tried to reproduce its steps as well. Right away it begins by fetching the key so I figured I'd get the same result by calling get/2:

iex> Pow.Store.Backend.MnesiaCache.get(config, hd(sessions))
:not_found

Which according to table_get/2 means that the fetch has returned nil. This seemed off since the same call for Pow.Store.Base.get/3 yielded the {user, metadata} earlier, and it was at this point that I decided to reach out for help. :sweat_smile:

What could be going off here? I suspect it has to do with sessions that persist between deploys, but I couldn't reproduce it consistently yet.

Any help will be much appreciated, thanks!

danschultzer commented 9 months ago

Very peculiar. This will take some effort to figure out, but could very well be a bug!

First, lets verify that when you fetch the key calling Pow.Store.Backend.MnesiaCache.get(config, hd(sessions)), the config also matches what credentials cache would set. You'll need to set a namespace and with the session cache it would be: [namespace: "credentials"]

Next, let's check what format the existing keys have and what's the expire timestamp value. Maybe something have set a much larger expire timestamp. If the config setup above worked it should return them.

If not you should be able to query them directly from Mnesia with :mnesia.dirty_select([{{Pow.Store.Backend.MnesiaCache, :_, :_}, [], [:"$_"]}]). If you also have other caches in Mnesia then this may produce too many results to go through, and could be scoped with: :mnesia.dirty_select([{{Pow.Store.Backend.MnesiaCache, ["credentials", :_], :_}, [], [:"$_"]}])

The Process.read_timer/1 returning false is interesting. It can happen when the timer has expired, when the ref doesn't exist, or if it has been cancelled. I suspect it may have been cancelled. This happens in multiple places, among others when the key is deleted or the key is upserted. But if you see it consistently being false then there may be something different going on.

First we'll check if you can fetch the key and values so we can further debug if there's anything different about the old sessions.

gVirtu commented 9 months ago

Thank you for the reply @danschultzer !

Indeed once I added the missing namespace, the session was returned in one piece. My bad! Could that somehow be misconfigured on the app's end as well? In a way that might be preventing MnesiaCache from being able to fetch its keys from inside delete_or_reschedule/3, which would leave the invalidators and table keys untouched. Here's our Pow config:

config :app, :pow,
  user: App.Auth.User,
  users_context: App.Auth,
  extensions: [PowResetPassword, PowEmailConfirmation, PowPersistentSession, PowInvitation],
  repo: App.Repo,
  cache_store_backend: Pow.Store.Backend.MnesiaCache,
  persistent_session_cookie_opts: [
    secure: true
  ]

Querying the expire timestamps, it does look like there's a significant chunk that's already past their expiration date.

iex> alias Pow.Store.Backend.MnesiaCache
iex> :mnesia.dirty_select(MnesiaCache, [{{MnesiaCache, ["credentials", :_], :_}, [], [:"$_"]}]) |> 
Map.new(fn {_cache, key, {_, expire}} -> {key, DateTime.from_unix(expire, :millisecond) end)                                                            
%{                                                                                                                                                       
  ["credentials", "bb8a0a9b-d5dd-4d6e-8913-f4de38bafbd5"] => {:ok,                                                                                       
   ~U[2023-08-28 21:16:42.873Z]},                                                                                                                        
  ["credentials", "65981219-a750-4001-8361-586ad5f07a5c"] => {:ok,                                                                                       
   ~U[2023-08-10 20:28:33.378Z]},                                                                                                                        
  ["credentials", "db7ffa49-143a-46cb-90e8-c4efba633849"] => {:ok,                                                                                       
   ~U[2023-08-10 19:02:25.777Z]},                                                                                                                        
  ["credentials", "b7709629-17e8-42f8-ad87-8d5ab3137a0e"] => {:ok,                                                                                       
   ~U[2023-07-13 20:15:53.189Z]},                                                                                                                        
  ["credentials", "bebe6449-2c8e-49fb-9da5-146a4212b5cf"] => {:ok,                                                                                       
   ~U[2023-06-13 23:52:52.089Z]},                                                                                                                        
  ["credentials", "dc395bc3-a9b5-4d2d-b78c-9a95ef6beb7e"] => {:ok,                                                                                       
   ~U[2023-07-14 19:53:13.103Z]},                                                                                                                        
  ["credentials", "784fb420-a415-41c6-9c5a-56fd27cd1f7b"] => {:ok,                                                                                       
   ~U[2023-06-27 22:24:41.075Z]},                                                                                                                        
  ["credentials", "78d0b931-6258-4e53-a7e9-6d9ea0501a73"] => {:ok,                                                                                       
   ~U[2023-08-29 20:58:04.250Z]},                                                                                                                        
  ["credentials", "96e426c6-cfcc-44f5-a664-351af9a52c6d"] => {:ok,                                                                                       
   ~U[2023-05-24 22:32:45.565Z]},
...
danschultzer commented 9 months ago

Could that somehow be misconfigured on the app's end as well?

I don't believe so as you can fetch the credentials setting the cache. The only way you can override this namespace is to make a custom cache module.

Back to what could go wrong here. Following the codepath we know that these are added to the invalidators when starting the mnesia cache (as you already wrote):

https://github.com/pow-auth/pow/blob/7c5775818b799d590d441d52510ae6deeaf6f074/lib/pow/store/backend/mnesia_cache.ex#L455-L460

The TTL being 0 is what catches my eye here. We are calling Process.send_after/2 with the TTL, and this will trigger invalidate:

https://github.com/pow-auth/pow/blob/7c5775818b799d590d441d52510ae6deeaf6f074/lib/pow/store/backend/mnesia_cache.ex#L229-L234

https://github.com/pow-auth/pow/blob/7c5775818b799d590d441d52510ae6deeaf6f074/lib/pow/store/backend/mnesia_cache.ex#L236-L238

https://github.com/pow-auth/pow/blob/7c5775818b799d590d441d52510ae6deeaf6f074/lib/pow/store/backend/mnesia_cache.ex#L202-L206

https://github.com/pow-auth/pow/blob/7c5775818b799d590d441d52510ae6deeaf6f074/lib/pow/store/backend/mnesia_cache.ex#L208-L213

From reading the docs where it says that the messages may be received before the message send by Process.send_after/4, makes me wonder if the reverse is happening here. What if the state that is ultimately set by init_invalidators only gets set after the message was sent and handled in the GenServer? It matches up with the ref being false. I'll set up a test to replicate this exact scenario, to see if that's the cause.

danschultzer commented 9 months ago

No, I can't reproduce the error by only setting an expired TTL and restarting:

# test/pow/store/backend/mnesia_cache_test.exs
    test "expired records are purged immediately on startup" do
      :mnesia.dirty_write({MnesiaCache, ["pow:test", "key"], {"value", :os.system_time(:millisecond) - 1}})
      assert MnesiaCache.get(@default_config, "key") == "value"
      restart(@default_config)
      assert_receive {:mnesia_table_event, {:delete, _, _}} # Wait delete to happen
      assert MnesiaCache.get(@default_config, "key") == :not_found
    end

This all works as expected. I'll try a few other things, and see how I can reproduce something where an expired session isn't being removed.

gVirtu commented 9 months ago

I've inspected the codepath locally with a copy of our staging environment and one thing that caught my attention is the config that is being passed around is an empty list. I guess that's to be expected as I'm not passing any extra options in application.ex, though.

With the inspects still in place, I then tried to sign in with an user and lo and behold, this time the config was there: :exploding_head:

[info] POST /api/v1/signin
trigger_ttl("4981f2d6-0452-477a-9416-c7008d860be4", 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.1183618603.1427636226.146442>
trigger_ttl([App.Auth.User, :user, 1], 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.1183618603.1427636226.146450>
trigger_ttl([App.Auth.User, :user, 1, :session, "4981f2d6-0452-477a-9416-c7008d860be4"], 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.1183618603.1427636226.146453>
trigger_ttl("e08589b5-5eb8-48e5-880a-d7812b16538d", 2592000000, [ttl: 2592000000, namespace: "persistent_session"]): #Reference<0.1183618603.1427636226.146456>
[info] Sent 200 in 475ms

The MnesiaCache usage docs do mention namespace as a configuration option, but it looks like it defaults to "cache". I tried to pass it as "credentials" and now it looks like those entries were promptly cleared!

iex> :mnesia.dirty_select(MnesiaCache, [{{MnesiaCache, ["credentials", :_], :_}, [], [:"$_"]}])
[]

There are still invalidators in there, but only those with UUID keys, and they still have mostly active timers:

iex> Process.whereis(Pow.Store.Backend.MnesiaCache) |> :sys.get_state() |> Map.get(:invalidators) |> Map.values() |> Enum.map(& Process.read_timer(&1))
[2488410885, false, 1903208148, false, 2488410111, false, false, 1903853122,
 1903156850, 1903687689, 2493595062, 1903897831, false, 1903561874, false,
 2488371072, 1903554220, 1903442037, 1903928609, 2488409282, 2488411728,
 1903377475, 1903772893, 2493729373, 678361955, false, 1903613699, false, 
 1903366255, 1903647865, 2493596154, false, false, false, 1903357776,
 2484443349, 1903472001, false, 1903993289, 1903989452, 2505517026, 2505697206,  
 1903883038, false, 2493597098]

Here's what I'd logged (before making any modifications to the namespace):

iex.log

Would it be fine to explicitly pass the namespace to MnesiaCache like this or might there be unintended side-effects? Thanks again.

danschultzer commented 9 months ago

Pow.Store.CredentialsCache would force the namespace. It's to prevent conflicts with other caches in pow, like the persistent session cache, etc, so there should be no need to set namespace anywhere. At glance I don't see anything off from any of the data, but I'll be going through more throughly soon, and try figure out the code path. Work came in the way.

We know that expired sessions gets picked up when mnesiacache inits, but for some reason the trigger to actually expires it either gets cancelled, or never runs in the first place. Alternatively something rewrite the session though I have no idea why. It smells of some race condition.

Are the expired sessions consistent (it's the same key)? Do you use Mnesia clustering or is this just a single node?

gVirtu commented 9 months ago

Pow.Store.CredentialsCache would force the namespace. It's to prevent conflicts with other caches in pow, like the persistent session cache, etc, so there should be no need to set namespace anywhere.

I see, that makes perfect sense.

I've worked on a more contrived example to reproduce this from an empty mnesia state. I got to the problematic state after doing these steps: (for these, using a short TTL will be convenient)

1. Spin up the phoenix server #### Console output: ```elixir init([]): {:ok, %{invalidators: %{}}} ```
2. Authenticate an user Via the custom authorization plug from the API tutorial, this calls `put/3` on both CredentialsCache and PersistentSessionCache in `Conn.register_before_send/1` #### Console output: ```elixir trigger_ttl("7c56f349-a262-49c6-a3d7-013bed0d1691", 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.176822019.4286054407.197793> trigger_ttl([App.Auth.User, :user, 1], 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.176822019.4286054404.201834> trigger_ttl([App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"], 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.176822019.4286054404.201837> trigger_ttl("887a5604-6f68-4e66-9f2d-054604175e29", 2592000000, [ttl: 2592000000, namespace: "persistent_session"]): #Reference<0.176822019.4286054404.201840> ```
3. Stop the server, wait until after the TTL passes, then start it up again #### Console output: ```elixir trigger_ttl([App.Auth.User, :user, 1], 0, []): #Reference<0.3574711379.2956460033.218191> append_invalidator(["credentials", App.Auth.User, :user, 1], %{}, 0, []): %{[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>} trigger_ttl([App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"], 0, []): #Reference<0.3574711379.2956460033.218205> append_invalidator(["credentials", App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"], %{[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>}, 0, []): %{ [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205> } trigger_ttl("7c56f349-a262-49c6-a3d7-013bed0d1691", 0, []): #Reference<0.3574711379.2956460033.218211> append_invalidator(["credentials", "7c56f349-a262-49c6-a3d7-013bed0d1691"], %{[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>}, 0, []): %{ [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>, "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211> } trigger_ttl("887a5604-6f68-4e66-9f2d-054604175e29", 2587044572, []): #Reference<0.3574711379.2956460033.218217> append_invalidator(["persistent_session", "887a5604-6f68-4e66-9f2d-054604175e29"], %{[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>, "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>}, 2587044572, []): %{ [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>, "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>, "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217> } init([]): {:ok, %{ invalidators: %{ [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>, "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>, "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217> } }} fetch([], [App.Auth.User, :user, 1]): nil handle_info({:invalidate, [], [App.Auth.User, :user, 1]}, ...): {:noreply, %{ invalidators: %{ [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>, "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>, "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217> } }} fetch([], [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"]): nil handle_info({:invalidate, [], [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"]}, ...): {:noreply, %{ invalidators: %{ [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>, "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>, "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217> } }} fetch([], "7c56f349-a262-49c6-a3d7-013bed0d1691"): nil handle_info({:invalidate, [], "7c56f349-a262-49c6-a3d7-013bed0d1691"}, ...): {:noreply, %{ invalidators: %{ [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>, "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>, "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217> } }} ```

The expired sessions and invalidators remained untouched, with the same key as before, locally on a single node.

It does look like some sort of race condition, but by looking at the code/logs I can't tell for sure what might be causing it. I want to build a minimal repo to reproduce this, but I might not be able to get to it for the next few days...

danschultzer commented 9 months ago

I see, reading this I strongly suspect it has something to do with this part:

https://github.com/pow-auth/pow/blob/7c5775818b799d590d441d52510ae6deeaf6f074/lib/pow/store/backend/mnesia_cache.ex#L209-L212

If for some reason fetching for the key returns nil then the invalidators remain be untouched (which I think in the first place is incorrect, it should still clear the invalidator ref). I'm unsure how that could happen without the key actually having been deleted, and I haven't been able to reproduce this in tests. I'll try set up a full demo app in local to see if I can reproduce it that way (since you are able to reproduce it consistently it should be trivial).

danschultzer commented 9 months ago

I've found a way to reproduce it, and indeed it seems to be a race condition. I'll work on a solution.

danschultzer commented 9 months ago

I found the issue, it was not a race condition, but incorrect handling of the keys. I have publish v1.0.34 that resolves this, please let me know if it works correctly now, and all the expired keys are being wiped.

This is a more serious issue than at first glance, so thank you so much for bringing this up! A CVE will be published shortly.