Permissions issues with mnesia backend for Pow

joshsmith commented 1 year ago

I'm running into a problem that has some more detail over here, but I'll try to share as much as I have done here, as well.

I'm using fly.io and have the following fly.toml which mounts a volume for mnesia:

[mounts]
destination = "/mnesia"
source = "it_mnesia"

And in my config.exs:

config :mnesia, :dir, '/mnesia'

Unfortunately, everything crashes upon start the Pow backend with mnesia:

2023-03-01T16:25:06Z   [info]16:25:06.511 [error] Couldn't initialize mnesia cluster because: {:change_table_copy_type, {:aborted, {:badarg, :schema, :unknown}}}
2023-03-01T16:25:06Z   [info]16:25:06.512 [notice] Application it exited: It.Application.start(:normal, []) returned an error: shutdown: failed to start child: Pow.Store.Backend.MnesiaCache
2023-03-01T16:25:06Z   [info]    ** (EXIT) {:change_table_copy_type, {:aborted, {:badarg, :schema, :unknown}}}
2023-03-01T16:25:08Z   [info]KeKernel pid terminated",application_controller,"{application_start_failure,it,{{shutdown,{failed_to_start_child,'Elixir.Pow.Store.Backend.MnesiaCache',{change_table_copy_type,{aborted,{badarg,schema,unknown}}}}},{'Elixir.It.Application',start,[normal,[]]}}}"}
2023-03-01T16:25:08Z   [info]Kernel pid terminated (application_controller) ({application_start_failure,it,{{shutdown,{failed_to_start_child,'Elixir.Pow.Store.Backend.MnesiaCache',{change_table_copy_type,{aborted,{badarg,schema,unknown}}}}},{'Elixir.It.Application',start,[normal,[]]}}})

In the thread they seemed to indicate that this may be due to permissions issues from the Elixir process. I thought I had confirmed this by changing the permissions myself, but now I can't remember what I did and the issue has come up again.

I would really love to get Pow working but am just banging my head against the wall with this issue. Any help would be appreciated. Thanks!

joshsmith commented 1 year ago

I'm no longer convinced that this is specifically a permissions issue.

I just tested creating an entirely new shared volume for mnesia and redeploying, which worked fine (tested by successfully creating and authenticating a user).

I then restarted my app altogether and got the same error as above. Not sure what this means, but hopefully helps someone who has more domain knowledge on this.

danschultzer commented 1 year ago

When the instance starts up the second time it will load from the mnesia disk copy. Something prevents it from fetching the table data, it could be file permission on the mnesia dir, it could be the schema settings in the mnesia disk copy. Do you have any mnesia settings in the pow config?

danschultzer commented 1 year ago

I've forked and updated the perm_error repo to make a more realistic test with writes to mnesia. I haven't experienced any issues with Pow MnesiaCache.

First run

2023-03-02T02:13:56.343 app[401b3b1a] dfw [info] Pow.Store.Backend.MnesiaCache.all(config, :_) #=> []
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] [perm_error.exs:4: (file)]
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] Pow.Store.Backend.MnesiaCache.put(config, {:key1, 1}) #=> :ok
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] [perm_error.exs:5: (file)]
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] Pow.Store.Backend.MnesiaCache.put(config, {:key2, 2}) #=> :ok
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] [perm_error.exs:6: (file)]
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] Pow.Store.Backend.MnesiaCache.all(config, :_) #=> [key1: 1, key2: 2]

Subsequent runs

2023-03-02T02:28:09.391 app[dbde1216] dfw [info] Pow.Store.Backend.MnesiaCache.all(config, :_) #=> [key1: 1, key2: 2]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] [perm_error.exs:5: (file)]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] Pow.Store.Backend.MnesiaCache.put(config, {:key1, 1}) #=> :ok
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] [perm_error.exs:6: (file)]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] Pow.Store.Backend.MnesiaCache.put(config, {:key2, 2}) #=> :ok
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] [perm_error.exs:7: (file)]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] Pow.Store.Backend.MnesiaCache.all(config, :_) #=> [key1: 1, key2: 2]

joshsmith commented 1 year ago

Thanks for putting that together. I forked and made some minor changes just to bring it in line more with my config (e.g. placing :mnesia in extra_applications, but it works exactly the same as your tests do.

I don't really believe that I have anything special in terms of configuration here. My app is on GitHub here.

joshsmith commented 1 year ago

Is there anything else I can do to to help diagnose what’s happening here?

danschultzer commented 1 year ago

I was wondering if something goes wrong with rolling deployment here. Having multiple instances accessing the same mnesia dir would corrupt the files. Reading this seems to not be the case: https://community.fly.io/t/how-to-have-a-zero-downtime-deploy-rails/9725/2

The old instance is shut down completely, before spinning up the new one. Corrupted files would probably have thrown a different error as well.

Another thing I looked at was the node name. My thinking is that maybe the node name is being changed which makes it so the the schema config is being set for a different node:

:mnesia.change_table_copy_type(:schema, node(), copy_type)

I could reproduce the error in the perm_error repo by spinning it up twice, first without the node name flag, second time with the node name flag:

VOL_DIR=./ elixir --sname foo -S mix run perm_error.exs

** (Mix) Could not start application perm_error: PermError.Application.start(:normal, []) returned an error: shutdown: failed to start child: Pow.Store.Backend.MnesiaCache
    ** (EXIT) {:change_table_copy_type, {:aborted, {:badarg, :schema, :unknown}}}

I couldn't find anything in the it github repo that sets the node name, but this seems to be the most likely cause. $RELEASE_NODE or $RELEASE_NAME is being changed for each deployment. Is there anything that sets either of these variables (or script starting up with --name or --sname flag)?

joshsmith commented 1 year ago

As far as I know, I’m not doing anything to set the node name. I can at least take this to Fly tomorrow morning and see if they have any ideas as to whether and why that would be happening.

danschultzer commented 1 year ago

I found the issue. By default the node just runs with a shortname in the default startup bash script generated by mix release (the one you find in _build/rel/NAME/bin/NAME):

RELEASE_NAME="${RELEASE_NAME:-"mnesiatest"}"
export RELEASE_NAME
# ...
RELEASE_NODE="${RELEASE_NODE:-"$RELEASE_NAME"}"
export RELEASE_NODE
#...
RELEASE_DISTRIBUTION="${RELEASE_DISTRIBUTION:-"sname"}"
export RELEASE_DISTRIBUTION

This means that the hostname is automatically set to whatever the hostname is on the instance:

2023-03-09T04:40:25.458 app[4b7b24db] dfw [info] 04:40:25.454 [info] Mnesia cluster initiated on :mnesiatest@4b7b24db

And the hostname is ever changing with the containers on fly.io! So you end up having schema config for a host that no longer exists (notice that hostname is now b19967fa):

2023-03-09T04:41:54.024 app[b19967fa] dfw [info] 04:41:54.021 [error] Couldn't initialize mnesia cluster because: {:change_table_copy_type, {:aborted, {:badarg, :schema, :unknown}}}
2023-03-09T04:41:54.030 app[b19967fa] dfw [info] 04:41:54.024 [notice] Application perm_error exited: PermError.Application.start(:normal, []) returned an error: shutdown: failed to start child: Pow.Store.Backend.MnesiaCache

The PoC is just to run the script this way:

RELEASE_NAME="${RELEASE_NAME:-"mnesiatest"}"
RELEASE_NODE="${RELEASE_NODE:-"$RELEASE_NAME"}"
RELEASE_DISTRIBUTION="${RELEASE_DISTRIBUTION:-"sname"}"

VOL_DIR="/data" elixir --$RELEASE_DISTRIBUTION "$RELEASE_NODE" -S mix run perm_error.exs

Now to fix this you need to set both RELEASE_DISTRIBUTION and RELEASE_NODE env vars:

RELEASE_NODE=mnesiatest@localhost
RELEASE_DISTRIBUTION=name

I was able to restart multiple times with no issue:

2023-03-09T04:54:36.385 app[514dcd8f] dfw [info] 04:54:36.381 [info] Mnesia cluster initiated on :mnesiatest@localhost

2023-03-09T04:56:03.627 app[1b619b4e] dfw [info] 04:56:03.623 [info] Mnesia cluster initiated on :mnesiatest@localhost

2023-03-09T04:58:13.468 app[05e570a2] dfw [info] 04:58:13.464 [info] Mnesia cluster initiated on :mnesiatest@localhost

The key here is to pin the node name, so it stays the same on each restart. Huge caveat here is that you should always only run just one instance at any time. If you go into distributed setup you need to find a way to keep a volume for each instance and keep the node names unique across the cluster.

Let me know if it works for you!

renews commented 1 year ago

@joshsmith this bellow worked for me

On fly.toml

[mounts]
  source="mnesia"
  destination="mnesia"

on config/prod.exs

config :app_name, :pow, cache_store_backend: Pow.Store.Backend.MnesiaCache
config :mnesia, :dir, '/mnesia'

joshsmith commented 1 year ago

@danschultzer thanks for all that investigative work!

I'm still struggling to understand, though, how I can make this work with mix release the way Fly seems to recommend running this. In other words, there is a builder image and then a runner image, and the app is ultimately run using the release CMD ["/app/bin/server"].

Is there a way to pin the node name when running the app this way, vs running your init.sh with mix run, or does pinning the node name necessitate running mix run directly?

joshsmith commented 1 year ago

@renews thanks, though I think you'll find that a restart of your node on Fly will reproduce the same results we've discussed above, since the node name will change as you can see in the results of Dan looking into this here.

danschultzer commented 1 year ago

I'm still struggling to understand, though, how I can make this work with mix release the way Fly seems to recommend running this. In other words, there is a builder image and then a runner image, and the app is ultimately run using the release CMD ["/app/bin/server"].

How did you set that up? I couldn't find the fly.io doc detailing that. The were-it repo had a init.sh script. And this fly.io doc also describes how to name the node: https://fly.io/docs/elixir/the-basics/naming-your-elixir-node/

You would just pin the host there. Again this only works with a single node, if you deal with a cluster you would have to find some way to either pin the name per data volume, use distribution and copy data from cluster before terminating old nodes, or find a way to migrate the data to the new node name (not sure if that's possible).

joshsmith commented 1 year ago

I had just added the init.sh script yesterday in an attempt to get this working. The original line in the Dockerfile was something like CMD [“app/bin/server”].

From the docs you sent, is there even anything else I’d need to do? If the node is named, then the problem should be solved, right? I may be misunderstanding your fix.

danschultzer commented 1 year ago

If the node is named, then the problem should be solved, right?

Yup that's all you need. It's the host part of the node name that kept changing because it was not set. Setting

The RELEASE_NODE and RELEASE_DISTRIBUTION env vars should be set somewhere before running app/bin/server. Mix release generates an env.sh that you could add it to, e.g. using mix release.init as fly.io suggests to generate the env.sh template. You could also set it in the Dockerfile as well, but a separate startup script is cleaner.

DohanKim commented 2 months ago

FYI, fly.io now automatically generates env.sh.eex file containing env variables.

#!/bin/sh

# configure node for distributed erlang with IPV6 support
export ERL_AFLAGS="-proto_dist inet6_tcp"
export ECTO_IPV6="true"
export DNS_CLUSTER_QUERY="${FLY_APP_NAME}.internal"
export RELEASE_DISTRIBUTION="name"
export RELEASE_NODE="${FLY_APP_NAME}-${FLY_IMAGE_REF##*-}@${FLY_PRIVATE_IP}"

you should change RELEASE_NODE to be persistent each deploy for mnesia clustering eg) export RELEASE_NODE="${FLY_APP_NAME}@${FLY_PRIVATE_IP}"

nduitz commented 2 months ago

Can confirm this works with a single node setup. Anyone got this working with more than one machine on fly? Since both machines need their own volume, the mnesia databases are out of sync and if I log in to the application and reload the page and happen to be redirected to the second node I lose my session.

pow-auth / pow

Permissions issues with mnesia backend for Pow #690

First run

Subsequent runs