processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6.03k stars 1.5k forks source link

Invalid username or password error on SQL DB timeout/failure #4056

Closed logicwonder closed 1 year ago

logicwonder commented 1 year ago

Environment

Errors from error.log

2023-06-29 21:07:31.622113+05:30 [error] <0.29828.3197>@proc_lib:crash_report/4:539 CRASH REPORT:
  crasher:
    initial call: xmpp_stream_in:init/1
    pid: <0.29828.3197>
    registered_name: []
    exception exit: {function_clause,
                        [{base64,decode_binary,
                             [<<>>,<<>>,52],
                             [{file,"base64.erl"},{line,369}]},
                         {xmpp_sasl_scram,mech_step,2,
                             [{file,"src/xmpp_sasl_scram.erl"},{line,131}]},
                         {xmpp_sasl,server_step,2,
                             [{file,"src/xmpp_sasl.erl"},{line,125}]},
                         {xmpp_stream_in,process_sasl_request,2,
                             [{file,"src/xmpp_stream_in.erl"},{line,906}]},
                         {xmpp_stream_in,handle_info,2,
                             [{file,"src/xmpp_stream_in.erl"},{line,411}]},
                         {p1_server,handle_msg,8,
                             [{file,"src/p1_server.erl"},{line,696}]},
                         {proc_lib,init_p_do_apply,3,
                             [{file,"proc_lib.erl"},{line,240}]}]}
      in function  p1_server:terminate/7 (src/p1_server.erl, line 878)
    ancestors: [ejabberd_c2s_sup,ejabberd_sup,<0.10808.0>]
    message_queue_len: 0
    messages: []
    links: [<0.18682.0>]
    dictionary: [{already_terminated,true},
                  {'$internal_queue_len',0},
                  {rand_seed,{#{bits => 58,jump => #Fun<rand.3.34006561>,
                                next => #Fun<rand.0.34006561>,type => exsss,
                                uniform => #Fun<rand.1.34006561>,
                                uniform_n => #Fun<rand.2.34006561>},
                              [165542753435725878|6331229633420500]}}]
    trap_exit: false
    status: running
    heap_size: 6772
    stack_size: 28
    reductions: 65454
  neighbours:

Bug description

I have recently faced a PostgreSQL database outage where the DB server got busy and all eJabberd query requests ( started timing out. During this period, I have observed that a large number of authentication requests were responded with "SCRAM-SHA-1 authentication. Invalid username or password." error. The error message in this case could have been different for e.g. "Authentication service unavailable".

Also I am suspecting that this error could be a result of a sync issue of users table between PostgreSQL and Mnesia.

Please guide me for a better understanding on what could have happened.

licaon-kter commented 1 year ago

Also I am suspecting that this error could be a result of a sync issue of users table between PostgreSQL and Mnesia.

Wait, what sync?

prefiks commented 1 year ago

It looks like data user for authentication is corrupted in database (at least for that user). You talk about converting from mnesia, what kind of procedure did you use for that? You could see what is stored in authentication tables by executing ejabberd_auth:get_password(<<"username">>, <<"my.domain">>). inside ejabberdctl debug console. Does that return something or maybe just empty data?

logicwonder commented 1 year ago

@licaon-kter I thought when an account is created, the user credentials are written SQL DB and Mnesia RAM DB. I think the usage of word "sync" was not correct.

@prefiks I did not do any migration from Mnesia. The users were created using eJabberd HTTP API and PostgreSQL is used as backend SQL DB. The SQL database was was timing out and users started getting the invalid username and password error. The scram error was also seen in the error.log I compared some of the account passwords from SQL DB with the one returned in the Mnesia query. I was unable to find any accounts with a mismatch. So does that mean that when the database is not accessible the above crash is possible and could have resulted in the error?

I am confused about my understanding of auth module.

licaon-kter commented 1 year ago

@logicwonder do get a https://gist.github.com with the (sanitized) config .yml

logicwonder commented 1 year ago

Here is my config https://gist.github.com/logicwonder/bb7923007e16783ebf2ac22ab914be9f

osmangoninahid commented 1 year ago

@logicwonder "DB server got busy and all eJabberd query requests ( started timing out)", Have you examined the database logs during that period? When the DB server becomes busy, does it mean it becomes unresponsive?

If that's the case, the root of the issue likely lies with the database. This situation arises when we're concurrently performing both write and read operations on the same table during peak loads. Due to locking, the database fails to respond and the session count drops to zero.

I encountered a similar problem and resolved it by scaling the database. I separated read and write operations by implementing read and write replicas, and I also introduced a database router by modifying the MySQL driver. This issue isn't specific to ejabberd; it's fundamentally a database performance matter."

logicwonder commented 1 year ago

Thanks for your assessment and suggestion. Yes it's clear that there was a performance issue in scaling the DB. However Invalid username and password is a misleading error response in such cases. Server busy or timeout errors could make more sense.