processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6k stars 1.5k forks source link

Bad cookie in table definition muc_online_room #4164

Closed logicwonder closed 2 months ago

logicwonder commented 4 months ago

Environment

Errors from console during startup

Error: {merge_schema_failed,"Bad cookie in table definition muc_online_room: 'ejabberd@node2.test' = {cstruct,muc_online_room,ordered_set,['ejabberd@node4.test','ejabberd@node3.test','ejabberd@node1.test','ejabberd@node2.test'],[],[],[],0,read_write,false,[],[],false,muc_online_room,[name_host,pid],[],[],[],{{1615260440706495858,-576460752303423071,1},'ejabberd@node1.test'},{{26,0},{'ejabberd@node4.test',{1708,63749,386751}}}}, 'ejabberd@node3.test' = {cstruct,muc_online_room,ordered_set,['ejabberd@node3.test'],[],[],[],0,read_write,false,[],[],false,muc_online_room,[name_host,pid],[],[],[],{{1708647743219259649,-576460752303423482,1},'ejabberd@node3.test'},{{2,0},[]}}\n"}

Bug description

I am running an eJabberd cluster with 4 nodes. eJabberd instance in two nodes (node2 and node 3) suddenly stopped and got removed from the cluster. After clearing Mnesia in node3, the node was started and successfully joined to cluster. The same operation was done on node2, the node was successfully started. But joining the cluster fails with the above error. We have deleted all contents in /usr/local/var/lib/ejabberd path. But still unable to join this node back to the cluster.

Please help.

badlop commented 4 months ago

The same operation was done on node2, the node was successfully started. But joining the cluster fails with the above error. We have deleted all contents in /usr/local/var/lib/ejabberd path. But still unable to join this node back to the cluster.

Ok, you already did the first idea that I was going to suggest you: if a node in the cluster cannot join correctly, stop it, delete the mnesia spool directory, start ejabberd (it will create an empty mnesia database) and join the cluster again (it will replicate the database from the other node).

From your explanation I understand this: when ejabberd node2 starts with an empty mnesia database, it starts correctly, no error message. Then you execute join_cluster node1 and this crashes with the error message that you mentioned... right?

I don't know what exactly causes the problem or how to solve it, but I have some ideas you can try:


We have deleted all contents in /usr/local/var/lib/ejabberd path.

I imagine that is the path where ejabberd stores the mnesia spool files in your system, right? That path varies in different machines/operating systems, so make sure you are really deleting the mnesia database that is problematic in node2:

If you deleted mnesia spool files correctly in node2, when you start ejabberd again in that node, it will create a new empty mnesia database, and you will see in the ejabberd log files like 40 lines like this, because all the tables are being created:

2024-03-06 16:43:30.568832+01:00 [info] Creating Mnesia disc table 'muc_registered'
2024-03-06 16:43:30.571403+01:00 [info] Creating Mnesia ram table 'muc_online_room'

When you delete node2 database, and later start ejabberd, do you get those messages in the log?


When you stop node2 and delete its mnesia database, make sure that node is completely forgotten and removed from the cluster:

Go to node1, check if node2 is still known by mnesia in node1, then tell node1 to delete node2 from the cluster, and verify this was done correctly (in my case the nodenames are ejabberd1@localhost and ejabberd2@localhost):

$ ejabberdctl mnesia_info
 {db_nodes,[ejabberd2@localhost,ejabberd1@localhost]},
...

$ ejabberdctl mnesia_info_ctl
running db nodes   = [ejabberd1@localhost]
stopped db nodes   = [ejabberd2@localhost]
...

$ ejabberdctl leave_cluster ejabberd2@localhost

$ ejabberdctl mnesia_info
 {db_nodes,[ejabberd1@localhost]},
...

$ ejabberdctl mnesia_info_ctl
running db nodes   = [ejabberd1@localhost]
stopped db nodes   = []
...

Just to be sure, you can repeat the process in the other nodes that are correctly connected to the cluster: remove node2 from all of them.

Once the old node2 is completely forgotten, start node2 with a new empty mnesia database and tell it to join the cluster. Let's hope this time there is no problem.

logicwonder commented 2 months ago

@badlop Thanks for taking your time for a detailed explaination My issue was solved after multiple attempts by clearing Mnesia.