Closed joshsmith closed 9 months ago
I'm no longer convinced that this is specifically a permissions issue.
I just tested creating an entirely new shared volume for mnesia
and redeploying, which worked fine (tested by successfully creating and authenticating a user).
I then restarted my app altogether and got the same error as above. Not sure what this means, but hopefully helps someone who has more domain knowledge on this.
When the instance starts up the second time it will load from the mnesia disk copy. Something prevents it from fetching the table data, it could be file permission on the mnesia
dir, it could be the schema settings in the mnesia disk copy. Do you have any mnesia settings in the pow config?
I've forked and updated the perm_error repo to make a more realistic test with writes to mnesia. I haven't experienced any issues with Pow MnesiaCache.
2023-03-02T02:13:56.343 app[401b3b1a] dfw [info] Pow.Store.Backend.MnesiaCache.all(config, :_) #=> []
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] [perm_error.exs:4: (file)]
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] Pow.Store.Backend.MnesiaCache.put(config, {:key1, 1}) #=> :ok
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] [perm_error.exs:5: (file)]
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] Pow.Store.Backend.MnesiaCache.put(config, {:key2, 2}) #=> :ok
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] [perm_error.exs:6: (file)]
2023-03-02T02:13:56.344 app[401b3b1a] dfw [info] Pow.Store.Backend.MnesiaCache.all(config, :_) #=> [key1: 1, key2: 2]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] Pow.Store.Backend.MnesiaCache.all(config, :_) #=> [key1: 1, key2: 2]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] [perm_error.exs:5: (file)]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] Pow.Store.Backend.MnesiaCache.put(config, {:key1, 1}) #=> :ok
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] [perm_error.exs:6: (file)]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] Pow.Store.Backend.MnesiaCache.put(config, {:key2, 2}) #=> :ok
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] [perm_error.exs:7: (file)]
2023-03-02T02:28:09.391 app[dbde1216] dfw [info] Pow.Store.Backend.MnesiaCache.all(config, :_) #=> [key1: 1, key2: 2]
Thanks for putting that together. I forked and made some minor changes just to bring it in line more with my config (e.g. placing :mnesia
in extra_applications
, but it works exactly the same as your tests do.
I don't really believe that I have anything special in terms of configuration here. My app is on GitHub here.
Is there anything else I can do to to help diagnose what’s happening here?
I was wondering if something goes wrong with rolling deployment here. Having multiple instances accessing the same mnesia dir would corrupt the files. Reading this seems to not be the case: https://community.fly.io/t/how-to-have-a-zero-downtime-deploy-rails/9725/2
The old instance is shut down completely, before spinning up the new one. Corrupted files would probably have thrown a different error as well.
Another thing I looked at was the node name. My thinking is that maybe the node name is being changed which makes it so the the schema config is being set for a different node:
:mnesia.change_table_copy_type(:schema, node(), copy_type)
I could reproduce the error in the perm_error
repo by spinning it up twice, first without the node name flag, second time with the node name flag:
VOL_DIR=./ elixir --sname foo -S mix run perm_error.exs
** (Mix) Could not start application perm_error: PermError.Application.start(:normal, []) returned an error: shutdown: failed to start child: Pow.Store.Backend.MnesiaCache
** (EXIT) {:change_table_copy_type, {:aborted, {:badarg, :schema, :unknown}}}
I couldn't find anything in the it
github repo that sets the node name, but this seems to be the most likely cause. $RELEASE_NODE
or $RELEASE_NAME
is being changed for each deployment. Is there anything that sets either of these variables (or script starting up with --name
or --sname
flag)?
As far as I know, I’m not doing anything to set the node name. I can at least take this to Fly tomorrow morning and see if they have any ideas as to whether and why that would be happening.
I found the issue. By default the node just runs with a shortname in the default startup bash script generated by mix release
(the one you find in _build/rel/NAME/bin/NAME
):
RELEASE_NAME="${RELEASE_NAME:-"mnesiatest"}"
export RELEASE_NAME
# ...
RELEASE_NODE="${RELEASE_NODE:-"$RELEASE_NAME"}"
export RELEASE_NODE
#...
RELEASE_DISTRIBUTION="${RELEASE_DISTRIBUTION:-"sname"}"
export RELEASE_DISTRIBUTION
This means that the hostname is automatically set to whatever the hostname is on the instance:
2023-03-09T04:40:25.458 app[4b7b24db] dfw [info] 04:40:25.454 [info] Mnesia cluster initiated on :mnesiatest@4b7b24db
And the hostname is ever changing with the containers on fly.io! So you end up having schema config for a host that no longer exists (notice that hostname is now b19967fa
):
2023-03-09T04:41:54.024 app[b19967fa] dfw [info] 04:41:54.021 [error] Couldn't initialize mnesia cluster because: {:change_table_copy_type, {:aborted, {:badarg, :schema, :unknown}}}
2023-03-09T04:41:54.030 app[b19967fa] dfw [info] 04:41:54.024 [notice] Application perm_error exited: PermError.Application.start(:normal, []) returned an error: shutdown: failed to start child: Pow.Store.Backend.MnesiaCache
The PoC is just to run the script this way:
RELEASE_NAME="${RELEASE_NAME:-"mnesiatest"}"
RELEASE_NODE="${RELEASE_NODE:-"$RELEASE_NAME"}"
RELEASE_DISTRIBUTION="${RELEASE_DISTRIBUTION:-"sname"}"
VOL_DIR="/data" elixir --$RELEASE_DISTRIBUTION "$RELEASE_NODE" -S mix run perm_error.exs
Now to fix this you need to set both RELEASE_DISTRIBUTION
and RELEASE_NODE
env vars:
RELEASE_NODE=mnesiatest@localhost
RELEASE_DISTRIBUTION=name
I was able to restart multiple times with no issue:
2023-03-09T04:54:36.385 app[514dcd8f] dfw [info] 04:54:36.381 [info] Mnesia cluster initiated on :mnesiatest@localhost
2023-03-09T04:56:03.627 app[1b619b4e] dfw [info] 04:56:03.623 [info] Mnesia cluster initiated on :mnesiatest@localhost
2023-03-09T04:58:13.468 app[05e570a2] dfw [info] 04:58:13.464 [info] Mnesia cluster initiated on :mnesiatest@localhost
The key here is to pin the node name, so it stays the same on each restart. Huge caveat here is that you should always only run just one instance at any time. If you go into distributed setup you need to find a way to keep a volume for each instance and keep the node names unique across the cluster.
Let me know if it works for you!
@joshsmith this bellow worked for me
On fly.toml
[mounts]
source="mnesia"
destination="mnesia"
on config/prod.exs
config :app_name, :pow, cache_store_backend: Pow.Store.Backend.MnesiaCache
config :mnesia, :dir, '/mnesia'
@danschultzer thanks for all that investigative work!
I'm still struggling to understand, though, how I can make this work with mix release
the way Fly seems to recommend running this. In other words, there is a builder image and then a runner image, and the app is ultimately run using the release CMD ["/app/bin/server"]
.
Is there a way to pin the node name when running the app this way, vs running your init.sh
with mix run
, or does pinning the node name necessitate running mix run
directly?
@renews thanks, though I think you'll find that a restart of your node on Fly will reproduce the same results we've discussed above, since the node name will change as you can see in the results of Dan looking into this here.
I'm still struggling to understand, though, how I can make this work with
mix release
the way Fly seems to recommend running this. In other words, there is a builder image and then a runner image, and the app is ultimately run using the releaseCMD ["/app/bin/server"]
.
How did you set that up? I couldn't find the fly.io doc detailing that. The were-it
repo had a init.sh script. And this fly.io doc also describes how to name the node: https://fly.io/docs/elixir/the-basics/naming-your-elixir-node/
You would just pin the host there. Again this only works with a single node, if you deal with a cluster you would have to find some way to either pin the name per data volume, use distribution and copy data from cluster before terminating old nodes, or find a way to migrate the data to the new node name (not sure if that's possible).
I had just added the init.sh
script yesterday in an attempt to get this working. The original line in the Dockerfile was something like CMD [“app/bin/server”]
.
From the docs you sent, is there even anything else I’d need to do? If the node is named, then the problem should be solved, right? I may be misunderstanding your fix.
If the node is named, then the problem should be solved, right?
Yup that's all you need. It's the host part of the node name that kept changing because it was not set. Setting
The RELEASE_NODE
and RELEASE_DISTRIBUTION
env vars should be set somewhere before running app/bin/server
. Mix release generates an env.sh
that you could add it to, e.g. using mix release.init
as fly.io suggests to generate the env.sh
template. You could also set it in the Dockerfile as well, but a separate startup script is cleaner.
FYI, fly.io now automatically generates env.sh.eex file containing env variables.
#!/bin/sh
# configure node for distributed erlang with IPV6 support
export ERL_AFLAGS="-proto_dist inet6_tcp"
export ECTO_IPV6="true"
export DNS_CLUSTER_QUERY="${FLY_APP_NAME}.internal"
export RELEASE_DISTRIBUTION="name"
export RELEASE_NODE="${FLY_APP_NAME}-${FLY_IMAGE_REF##*-}@${FLY_PRIVATE_IP}"
you should change RELEASE_NODE to be persistent each deploy for mnesia clustering
eg)
export RELEASE_NODE="${FLY_APP_NAME}@${FLY_PRIVATE_IP}"
Can confirm this works with a single node setup. Anyone got this working with more than one machine on fly? Since both machines need their own volume, the mnesia databases are out of sync and if I log in to the application and reload the page and happen to be redirected to the second node I lose my session.
I'm running into a problem that has some more detail over here, but I'll try to share as much as I have done here, as well.
I'm using fly.io and have the following
fly.toml
which mounts a volume formnesia
:And in my
config.exs
:Unfortunately, everything crashes upon start the Pow backend with
mnesia
:In the thread they seemed to indicate that this may be due to permissions issues from the Elixir process. I thought I had confirmed this by changing the permissions myself, but now I can't remember what I did and the issue has come up again.
I would really love to get Pow working but am just banging my head against the wall with this issue. Any help would be appreciated. Thanks!