processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6.06k stars 1.51k forks source link

hourly switch instead of failover with two LDAP servers #3977

Closed cfrepak closed 2 weeks ago

cfrepak commented 1 year ago

Hi,

We're using LDAP authentication with two redundant servers and we're noticing some strange behavior. Instead of failover, the ejabber host disconnects every hour and switches to the other server. The disconnection does not occur if only one LDAP server is configured. Is that normal?

edit: I should note. If the active LDAP server is shut down, no failover is performed! Each user is simply logged out. So what's the point of having the option for more than one server in the settings?

2022-12-12 02:14:09.003489+01:00 [warning] <0.702.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389
In State: active
2022-12-12 03:06:11.229305+01:00 [warning] <0.701.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa1.example.com:389
In State: active
2022-12-12 03:06:11.229764+01:00 [warning] <0.699.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa1.example.com:389
In State: active
2022-12-12 03:14:09.001754+01:00 [warning] <0.700.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389
In State: active
2022-12-12 03:14:09.002146+01:00 [warning] <0.702.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389
In State: active
2022-12-12 04:06:13.008805+01:00 [warning] <0.701.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa1.example.com:389
In State: active
2022-12-12 04:06:13.009095+01:00 [warning] <0.699.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa1.example.com:389
In State: active
2022-12-12 04:14:09.003701+01:00 [warning] <0.700.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389
In State: active
2022-12-12 04:14:09.004012+01:00 [warning] <0.702.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389
In State: active
2022-12-12 05:06:14.002009+01:00 [warning] <0.701.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa1.example.com:389
In State: active
2022-12-12 05:06:14.002409+01:00 [warning] <0.699.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa1.example.com:389
In State: active
2022-12-12 05:14:09.003687+01:00 [warning] <0.700.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389
In State: active
2022-12-12 05:14:09.003956+01:00 [warning] <0.702.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389
In State: active
2022-12-12 06:06:22.002131+01:00 [warning] <0.701.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa1.example.com:389
In State: active
2022-12-12 06:06:22.002442+01:00 [warning] <0.699.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa1.example.com:389
In State: active
2022-12-12 06:14:09.003690+01:00 [warning] <0.700.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389
In State: active
2022-12-12 06:14:09.004130+01:00 [warning] <0.702.0>@eldap:handle_info/3:695 LDAP server closed the connection: ipa2.example.com:389

configuration

auth_method: ldap
ldap_servers:
  - "ipa1.example.com"
 #- "ipa2.example.com"
ldap_port: 389
ldap_rootdn: "uid=ejabberd,cn=users,cn=accounts,dc=example,dc=com"
ldap_password: "...."
ldap_base: "dc=example,dc=com"
ldap_uids:
  - "uid"
ldap_filter: "(memberOf=cn=jabber_users,cn=groups,cn=accounts,dc=example,dc=com)"

Originally posted by @frashman123 in https://github.com/processone/ejabberd/discussions/3956

Maythemo commented 1 year ago

same for me when I add second LDAP and one of them disconnects , no one can login Don't have the problem with one.

Neustradamus commented 1 year ago

@frashman123, @Maythemo: With the current version, it is good?

cfrepak commented 1 year ago

The current version in Debian Bullseye (Backports) and Bookworm is 23.01 and the issue is still present.

badlop commented 2 weeks ago

what's the point of having the option for more than one server in the settings?

The benefit of having several servers listed in ldap_servers is performance.

cfrepak commented 2 weeks ago

Is it fixed in the next version? I can not find a commit regarding the issue.

badlop commented 2 weeks ago

You are right, there is no relevant commit.

I closed this issue because the original question was answered (having several ldap servers is not designed for failover, it is for performance), there was almost no movement in this topic in more than a year, it attracted few interest, and at the end of the day, something should be done with the issue.

If you would like that ejabberd has LDAP-server failover support, then the issue can be rephrased as a feature request.

cfrepak commented 2 weeks ago

Thank you for your quick reply. But honestly, what do you expect? An issue (not a feature request*) is opened, someone says "same for me" and then nothing happened for over a year. Do you need more people to tell you that something isn't working?

* Why this shouldn't be a feature request. The answer to my rethorical question in the opening is not just performance, especially when an LDAP server is designed to process thousands of requests in a short period of time. If one server is down for maintenance, the other has to step in. Two LDAP servers are therefore not just for performance, but also for redundancy. You have two options for this scenario: You can install a load balancer before LDAP (or another service for that matter) or the software must be able to switch over.

In this case, ejabberd is able to switch - but not for failover reasons, nor performance. As you can see in the log, ejabberd only connects to one server at a time and roundrobin to the next one after 60 minutes. This means that neither performance nor failover is addressed.

However, there is a option "ldap_backups":

"A list of IP addresses or DNS names of LDAP backup servers. When no servers listed in ldap_servers option are reachable, ejabberd will try to connect to these backup servers. The default is an empty list, i.e. no backup servers specified. WARNING: ejabberd doesn’t try to reconnect back to the main servers when they become operational again, so the only way to restore these connections is to restart ejabberd. This limitation might be fixed in future releases."

Again: Why is this option available if a failover back to main cannot be performed?

Please don't get me wrong, I really appreciate your work. But redundancy always has two meanings: Performance and reliability.

badlop commented 2 weeks ago

Update: ldap_backups mentioned in the bottom


The ldap_servers option was first added 20 years ago in 0822a55f05bb327f0d362e0a3de205f5f1ce604a

The actual LDAP code is implemented in eldap.erl which was started 24 years ago for other project and was modified over the time by several people.

In a quick look, it seems there is code to reconnect when connection fails, in lines 619, 716 and 1112.

So this indeed looks like a bug, or configuration problem, or unexpected behavior...


Test environment

I've setup two ldap servers in podman (in ports 1636 and 2636)

``` yml version: '3.7' services: ldap1: container_name: ldap1 hostname: ldap1 image: docker.io/osixia/openldap:latest environment: LDAP_ADMIN_PASSWORD: admin LDAP_BASE_DN: dc=example LDAP_DOMAIN: example LDAP_TLS_VERIFY_CLIENT: try command: --copy-service --loglevel debug ports: - 1636:636 ldap2: container_name: ldap2 hostname: ldap2 image: docker.io/osixia/openldap:latest environment: LDAP_ADMIN_PASSWORD: admin LDAP_BASE_DN: dc=example LDAP_DOMAIN: example LDAP_TLS_VERIFY_CLIENT: try command: --copy-service --loglevel debug ports: - 2636:636 ```

I had to tweak eldap.erl to use the correct port when connecting to each server

``` diff diff --git a/src/eldap.erl b/src/eldap.erl index 3676bd09a..3d06b84aa 100644 --- a/src/eldap.erl +++ b/src/eldap.erl @@ -1148,6 +1148,10 @@ format_error(SockMod, Reason) -> -define(CONNECT_TIMEOUT, timer:seconds(15)). -define(DNS_TIMEOUT, timer:seconds(5)). +connect("ldap1" = Host, 636, Mod, Opts) -> + connect(Host, 1636, Mod, Opts); +connect("ldap2" = Host, 636, Mod, Opts) -> + connect(Host, 2636, Mod, Opts); connect(Host, Port, Mod, Opts) -> case lookup(Host) of {ok, AddrsFamilies} -> @@ -1157,6 +1161,7 @@ connect(Host, Port, Mod, Opts) -> end. do_connect([{IP, Family}|AddrsFamilies], Port, Mod, Opts, _Err) -> + ?INFO_MSG("Connecting to LDAP server using mod ~p IP ~p port ~p", [Mod, IP, Port]), case Mod:connect(IP, Port, [Family|Opts], ?CONNECT_TIMEOUT) of {ok, Sock} -> {ok, Sock}; ```

And finally I can configure ejabberd:

``` erlang hosts: - localhost auth_method: [ldap] auth_use_cache: false ldap_servers: - "ldap1" - "ldap2" ldap_encrypt: tls ldap_rootdn: "cn=admin,dc=example" ldap_password: "admin" ldap_base: "dc=example" ldap_uids: - "cn" ```

When ejabberd starts, it connects to both of them. Two connections per server for auth, and another connection per server for vcard, that is six connections:

2024-08-27 20:46:56.091537+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 1636
2024-08-27 20:46:56.091544+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636
2024-08-27 20:46:56.223553+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 1636
2024-08-27 20:46:56.223739+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636
2024-08-27 20:46:56.223572+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 1636
2024-08-27 20:46:56.223781+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636

I can also see in the logs of both servers that they receive the connections.

Only connects to one LDAP server?

When ejabberd starts, I can see in the logs of both LDAP servers that they both receive the ejabberd connection.

Looking at the erlang process list, I can see ejabberd starts the erlang processes to establish connections to both LDAP server for authentication, I can even query the process group manager the processes IDs:

pg:get_local_members(eldap_pool_ejabberd_auth_ldap_localhost).
[<0.1446.0>,<0.1445.0>]

When a client tries to authenticate, eldap_pool:pg_get_closest_pid(Group) calls get_local_members to determine to what process and consequently what LDAP server the query will be sent. This pool is NOT updated when a LDAP server gets disconnected.

When I close one LDAP server, the associated process is still alive, and it is infinitely retrying to connect to the remote LDAP server:

2024-08-27 19:17:21.170111+02:00 [error] LDAP connection to ldap1:636 failed: connection refused

The users that regularly get pushed to that LDAP server will not be able to login until this LDAP server is available again.

The already connected users are still online, they are not kicked.

Reconnect every 60 minutes?

I waited 110 minutes, and ejabberd didn't show the warning messages that you receceived. Nothing was logged in the ldap servers. In my setup, the LDAP servers didn't close the connection, maybe because they are local connections?

If that 60 minutes were defined somewhere in ejabberd, it should be in eldap.erl. There's no mention to 60 minutes (or 3600000 milliseconds) anywhere in that file.

Maybe the 60 minutes timeout is defined in your LDAP server, or the LDAP server machine operating system?

Review assumptions

At the light of the source code and the experienced behaviour in a controlled testing environment, let's review again the original questions and assumptions:

Instead of failover, the ejabber host disconnects every hour

The log message explicitely says that it is the LDAP server who disconnects. And each LDAP server does every 60 minutes: ipa1 at minute 06, and ipa2 at minute 14.

and switches to the other server.

Is that what you interpret from the logs?

Every hour a LDAP server disconnects, and the corresponding eldap process connects again. Both servers are connected all the time.

As you have two servers, and eldap establishes two connections to each one, every hour you have 4 log messages.

The disconnection does not occur if only one LDAP server is configured. Is that normal?

It's a good question that you should investigate.

edit: I should note. If the active LDAP server is shut down, no failover is performed!

Umm, I tried right now, and eldap tries to reconnect quite aggresively, one attempt every second:

2024-08-27 20:30:27.655090+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636
2024-08-27 20:30:27.655077+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636
2024-08-27 20:30:27.655269+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636
2024-08-27 20:30:27.655701+02:00 [error] LDAP connection to ldap2:636 failed: connection refused
2024-08-27 20:30:27.655734+02:00 [error] LDAP connection to ldap2:636 failed: connection refused
2024-08-27 20:30:27.655860+02:00 [error] LDAP connection to ldap2:636 failed: connection refused
2024-08-27 20:30:28.168939+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636
2024-08-27 20:30:28.168939+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636
2024-08-27 20:30:28.169085+02:00 [info] Connecting to LDAP server using mod ssl IP {127,0,0,1} port 2636
2024-08-27 20:30:28.169597+02:00 [error] LDAP connection to ldap2:636 failed: connection refused
2024-08-27 20:30:28.169720+02:00 [error] LDAP connection to ldap2:636 failed: connection refused
2024-08-27 20:30:28.169791+02:00 [error] LDAP connection to ldap2:636 failed: connection refused

Each user is simply logged out.

In my tests, authentication is only used to authenticate. When a client has connected and authenticated, remains connected until it disconnects. The problem is that the client will not be able to login again.

So what's the point of having the option for more than one server in the settings?

I already answered that.

when I add second LDAP and one of them disconnects , no one can login

Right, I was able to reproduce that. In other words: if ejabberd is configured to use two LDAP servers, eldap_pool will send client authentication requests to those servers in a deterministic way. If a server gets down, then the clients that would be handled by that LDAP server cannot login.

Reconnection attempts are done every second, as seen in the ejabberd log file.

If one server is down for maintenance, the other has to step in. Two LDAP servers are therefore not just for performance, but also for redundancy. You have two options for this scenario: You can install a load balancer before LDAP (or another service for that matter) or the software must be able to switch over.

As we have seen, it seems the design decisions and your expectations don't match.

As you can see in the log, ejabberd only connects to one server at a time and roundrobin to the next one after 60 minutes.

I think that interpretation of the log messages is wrong, and does not conform to the reality, as I've experienced and described, which is 100% reproducible with the setup guide I provided.

However, there is a option "ldap_backups": Again: Why is this option available if a failover back to main cannot be performed?

Good question; I don't know how that option works. I'll have to look at that the next day, as I already spent several hours today investigating this.

Update: How does ldap_backups work?

I've setup another ldap server and configured ejabberd like this:

ldap_servers:
  - "ldap1"
ldap_backups:
  - "ldap2"
  - "ldap3"

This is the behavior I've seen:

Summary

cfrepak commented 2 weeks ago

Many thanks for the intensive tests. I will try the ldap_backups option. If I find something that does not meet my expectations, I will submit a feature request.

badlop commented 1 day ago

I've briefly extended the documentation of ldap_servers and ldap_backups, and added a paragraph in the LDAP documentation page summarizing how those options interact.