Closed nstiac closed 6 months ago
Where does mod_bosh_r2webusers
come from?
I guess r2webusers is a host that they use.
First thing, is your client able to use websocket instead of bosh? If so please change it to use websocket, bosh is slower and uses significantly more resources.
To diagnose this you could use ejabberd debug
console. When you start it you can see processes that have big queues by executing p1_prof:q().
in it (that dot at end is significant, don't skip it). This will list 10 processes with biggest queues. To see what messages are queued, copy pid(..., ..., ...)
from that previous command (it will have three numbers), and execute erlang:process_info(pid(... ,... , ...), messages).
with that pid.
I am guessing those messages will be xmpp packets that need to be delivered to client, but bosh connection is just to slow to deliver them at rate those arrive.
Environment
Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml
ps.- Obvious omissions on config for security purposes.
Errors from error.log/crash.log
9 hours after dawn restart with a relatively small internal userbase which functions mainly with pubsub as a "logged in or not" status indicator :
Bug description
After upgrading our server to an absurd amount of memory, we've recently been debugging our server performance and noticed some HUGE spikes on memory usage (30GB+ spikes on resident mem, 60GB+ virtual) coming from beam.smp/ejabberd whereas its average resident size remains steadily below ~400Mb before and after the spikes.
So, after looking at some logs we came upon these:
The only other possible relative information from that log is that it was filled with : Closing c2s session for ... Stream reset by peer.
(Could it be that those "stream reset by peer" are actually accumulating as undelivered messages? If so, how to bypass that given it is definetely not something that can be handled on the client side as this is a known issue :: can we forego those messages and have'm purged somehow?)
In order to clean this up a bit, we cleared our logs and restarted the server yesterday at dawn to obtain a cleaner shorter log .. here's a clean copy of the past 12hours (as you can see only a few lines shown.. it was dawn afterall yet the problem still shows) ;
FULL LOG (not too big I promise) :
More info :
Please notice 👀 Where do 1+ MILLION messages come from? how can I see a sample of these "messages" ? (Could it be that those "stream reset by peer & failed authentications" are actually accumulating as undelivered messages? If so, can we forego those messages and have'm purged somehow?)
SQL Shows no "last or archive" nor any other relevant numbers.
Webadmin shows:
Any help would be greatly appreciated on debugging what is going on and why is this happening.
ps.- Also, there's no correlation with failed/closed sessions or any burst on network activity.