processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6k stars 1.5k forks source link

occupant-id not stable in some cases #4236

Open lovetox opened 1 week ago

lovetox commented 1 week ago

Environment

Bug description

I have 2 different users in Gajim that reported a problem with occupant-id in 2 different MUCs on different servers.

When we investigated we found that the occupant-id in the presence is different from the occupant-id the server adds to their messages.

<message xmlns="jabber:client" to="en@chat.404.city" type="groupchat" id="8dcee6be-1769-4d7d-acbf-1e545b7a950f">
  <body>test</body>
  <origin-id xmlns="urn:xmpp:sid:0" id="8dcee6be-1769-4d7d-acbf-1e545b7a950f" />
</message>

<message xmlns="jabber:client" xml:lang="en" to="myname@myserver.com/gajim.J39W82BR" from="en@chat.404.city/myname" type="groupchat" id="8dcee6be-1769-4d7d-acbf-1e545b7a950f">
  <archived xmlns="urn:xmpp:mam:tmp" by="en@chat.404.city" id="1719035916432908" />
  <stanza-id xmlns="urn:xmpp:sid:0" by="en@chat.404.city" id="1719035916432908" />
  <origin-id xmlns="urn:xmpp:sid:0" id="8dcee6be-1769-4d7d-acbf-1e545b7a950f" />
  <occupant-id xmlns="urn:xmpp:occupant-id:0" id="g20AAAAgAozSRwPfZaleTLJ0hNM73x5BBMGnB2HfuA12Mm9MpNU=" />
  <body>test</body>
</message>

<presence xmlns="jabber:client" xml:lang="en" to="myname@myserver.com/gajim.J39W82BR" from="en@chat.404.city/myname" id="2aadd6b4-8c2a-4471-9b4f-3a098416a765">
  ...
  <occupant-id xmlns="urn:xmpp:occupant-id:0" id="g20AAAAgSp/6G+pFGi1tumup3r8yinT6Kk4yPZzoPGdFIzLrteI=" />
  <x xmlns="http://jabber.org/protocol/muc#user">
    <item jid="myname@myserver.com/gajim.J39W82BR" role="participant" affiliation="none" />
    <status code="110" />
</x>
  <show>xa</show>
</presence>

Multiple device where joined in the MUC, if that could be a issue. We have no reproducible case for now, but it happens.

Could you please check the code, under which conditions this could happen?

badlop commented 1 week ago

I played trying to reproduce the problem, but did not get it.

Does the problem appear from time to time in your server, even if you are not able to reproduce it voluntarily on demand?

In that case, you could apply a small patch to the modue that generates an ID that we can later decrypt, to determine what is the difference between the two ones

diff --git a/src/mod_muc_occupantid.erl b/src/mod_muc_occupantid.erl
index fabd69e28..42449e563 100644
--- a/src/mod_muc_occupantid.erl
+++ b/src/mod_muc_occupantid.erl
@@ -74,8 +74,8 @@ add_occupantid_packet(Packet, RoomJid) ->
     xmpp:set_subtag(Packet, OccupantElement).

 calculate_occupantid(From, RoomJid) ->
-    Term = {jid:remove_resource(From), get_salt(RoomJid)},
-    misc:term_to_base64(crypto:hash(sha256, io_lib:format("~p", [Term]))).
+    Term = {jid:remove_resource(From), RoomJid, get_salt(RoomJid)},
+    misc:term_to_base64(io_lib:format("~p", [Term])).

 %%%
 %%% Table storing rooms' salt

Then, the ID is a very long string like:

  <occupant-id id='g2wAAAABbAAAAANhe2wAAAAJbAAAAANhe2wAAAANawADamlkYSxsAAAABWE8YTxrAAcidXNlcjIiYT5hPmphLGwAAAAFYTxhPGsACyJsb2NhbGhvc3QiYT5hPmphLGsABDw8Pj5hLGwAAAAFYTxhPGsAByJ1c2VyMiJhPmE+amEsbAAAAAVhPGE8awALImxvY2FsaG9zdCJhPmE+amEsawAEPDw+PmphfWphLGEKawABIGwAAAADYXtsAAAAD2sAA2ppZGEsbAAAAAVhPGE8awAHInNhbGExImE+YT5qYSxsAAAABWE8YTxrABYiY29uZmVyZW5jZS5sb2NhbGhvc3QiYT5hPmphLGsABDw8Pj5hLGwAAAAFYTxhPGsAByJzYWxhMSJhPmE+amEsYQpsAAAAAmwAAAAEYSBrAAIgIGEgYSBqYSBqbAAAAAVhPGE8awAWImNvbmZlcmVuY2UubG9jYWxob3N0ImE+YT5qYSxrAAQ8PD4+amF9amEsYQprAAEgbAAAAAVhPGE8awAWIjE1OTU4NTQzNDU1MjgyNzcwNjEyImE+YT5qamF9amo='
    xmlns='urn:xmpp:occupant-id:0'/>

That can be easily decrypted to know what user, what room and what room salt were used:

misc:base64_to_term("g2wAAAABbAAAAANhe2wAAAAJbAAAAANhe2wAAAANawADamlkYSxsAAAABWE8YTxrAAcidXNlcjIiYT5hPmphLGwAAAAFYTxhPGsACyJsb2NhbGhvc3QiYT5hPmphLGsABDw8Pj5hLGwAAAAFYTxhPGsAByJ1c2VyMiJhPmE+amEsbAAAAAVhPGE8awALImxvY2FsaG9zdCJhPmE+amEsawAEPDw+PmphfWphLGEKawABIGwAAAADYXtsAAAAD2sAA2ppZGEsbAAAAAVhPGE8awAHInNhbGExImE+YT5qYSxsAAAABWE8YTxrABYiY29uZmVyZW5jZS5sb2NhbGhvc3QiYT5hPmphLGsABDw8Pj5hLGwAAAAFYTxhPGsAByJzYWxhMSJhPmE+amEsYQpsAAAAAmwAAAAEYSBrAAIgIGEgYSBqYSBqbAAAAAVhPGE8awAWImNvbmZlcmVuY2UubG9jYWxob3N0ImE+YT5qYSxrAAQ8PD4+amF9amEsYQprAAEgbAAAAAVhPGE8awAWIjE1OTU4NTQzNDU1MjgyNzcwNjEyImE+YT5qamF9amo=").

{term,[[123,
        [[123,
          ["jid",44,
           [60,60,"\"user2\"",62,62],
           44,
           [60,60,"\"localhost\"",62,62],
           44,"<<>>",44,
           [60,60,"\"user2\"",62,62],
           44,
           [60,60,"\"localhost\"",62,62],
           44,"<<>>"],
          125],
         44,10," ",
         [123,
          ["jid",44,
           [60,60,"\"sala1\"",62,62],
           44,
           [60,60,"\"conference.localhost\"",62,62],
           44,"<<>>",44,
           [60,60,"\"sala1\"",62,62],
           44,10,
           [[32,"  ",32,32],32],
           [60,60,"\"conference.localhost\"",62|...],
           44,"<<>>"],
          125],
         44,10," ",
         [60,60,"\"15958543455282770612\"",62,62]],
        125]]}

If you compile ejabberd from source code, simply apply the patch, recompile, install the file and restart the module or restart the whole ejabberd.

If you cannot compile ejabberd yourself, ping me and I'll generate an updated module *.beam file so you can install it.

prefiks commented 1 week ago

It could be that room was recreated between those two stanzas, that would change salt (which i think is expected). Only thing that's worth checking is if room hibernation doesn't trigger salt regeneration - it that case we don't want to do that.

Also Badlop was there any reason why you didn't want to put salt in muc_room state but used external mnesia table?

badlop commented 1 week ago

Also Badlop was there any reason why you didn't want to put salt in muc_room state but used external mnesia table?

Yes, the reasoning was something like this:

It could be that room was recreated between those two stanzas, that would change salt (which i think is expected).

Right, that could be the case here, and it makes sense that the occupantid changes, as the room is different (may be from a different creator, owner, ... both rooms just have the same roomname).

If that's what happened in this case, it will be revealed when there are new logs that have occupantid we can unencrypt to determine their source information.

Only thing that's worth checking is if room hibernation doesn't trigger salt regeneration - it that case we don't want to do that.

Aha, that could be the case here too, and in that case the module should be improved.

Looking at the code, a room salt is only removed when the room is removed (and the remove_room hook is called):

Maybe the room is non-persistent, using mnesia storage, got hibernated and the module deleted its salt when got unhibernated. I'll check and update this comment.

prefiks commented 1 week ago

Another thing related to hibernation even if hibernation doesn't reset salt (and i think we call remove_room hook only when room is truly removed, and not just hibernated), could be restoring room on different node, i don't think that mnesia table is synced between nodes, so if room was restored on different node, which didn't have it, it will make new one.

But i think it most probably room that is not set to persistent, and it just get deleted when everyone leaves and then get recreated.

badlop commented 1 week ago

A temporary (that is, not-persistent) room is deleted when the last occupant leaves and it has no subscribers.

I was able to reproduce the problem:

I wonder what was the requirement to forget_room when restoring a temporary room.

Removing it solves this particular problem:

```diff diff --git a/src/mod_muc.erl b/src/mod_muc.erl index e297d866f..3184fe90a 100644 --- a/src/mod_muc.erl +++ b/src/mod_muc.erl @@ -903,14 +903,7 @@ load_room(RMod, Host, ServerHost, Room, ResetHibernationTime) -> Res2; _ -> ?DEBUG("Restore hibernated non-persistent room: ~ts", [Room]), - Res = start_room(RMod, Host, ServerHost, Room, Opts0), - case erlang:function_exported(Mod, get_subscribed_rooms, 3) of - true -> - ok; - _ -> - forget_room(ServerHost, Host, Room) - end, - Res + start_room(RMod, Host, ServerHost, Room, Opts0) end end. ```
lovetox commented 1 week ago

Hi,

i think its a fair assumption that a salt is regenerated after a room was destroyed.

the room in question here is a group chat with >60 people, and its configured as persistent (at least today), i think it unlikely that it was destroyed.

A bit more info, we could see the problem happening consistently, as in messages always had a different id than the presence. It was not a case of send a message on one day, and check the presence on the next day.

so both occupant-ids where present at the same time, not a case of occupant-id changed because of an event and from that moment on it was different everywhere.

I try to get more logs from the user, maybe we can see what the occupant-id was over a longer period of time.

prefiks commented 1 week ago

@badlop Ah, missed that forget room call. That call was there to delete entry from db, as we will not delete this otherwise, we probably should add function that just delete room from db, and have other that remove from db and calls that hook.

@lovetox Do you know if server where this is deployed use clustering or is that just single computer?

lovetox commented 6 days ago

its 404.city, we try to reach the admin, but until now no success

lovetox commented 6 days ago

EDIT: Sorry got this wrong the first time.

Seems like @prefiks found the reason

we are storing those in mnesia table, not sure if it's set to be stored on disk yes, table that is used by it only keeps that in memory yeah, i will see if we can change that