tiredofit / docker-openldap

Dockerized LDAP server with many customizable options
MIT License
108 stars 48 forks source link

TLS/startTLS encryption not stable #10

Closed jrevillard closed 4 years ago

jrevillard commented 4 years ago

Hi,

Does anybody already faced the same issue than me. I setup this image (actually, the fusiondirectory one but I think that this comes from this base image) to use Let'sencrypt certificates (externally managed).

WHen I start the image, everything works as expected but after a couple of hours the TLS/startTLS encryption seems to stop working... again, after some hours more, everything is back to normal. This Let'sencrypt certificates are not modified during that time and the container does not restart or whatever.

I have for instance a process which query the ldap every minute: everything worked this night and since this morning, I have the following issue:

ldap_start_tls: Connect error (-11)
    additional info: Public key signature verification has failed.
ldap_result: Can't contact LDAP server (-1)

The error server side is:

TLS: can't accept: error:1403710B:SSL routines:ACCEPT_SR_KEY_EXCH:wrong version number.
openldap-fusiondirectory_openldap-fusiondirectory.1.c574qn099eaa@xxxxxxxx.gnubila.fr    | 5db7e51b conn=30845 fd=18 closed (TLS negotiation failure)

As I said, even without touching anything, it will work again in a couple of hours but of course, if I restart the container now everything works for some hours...

I really don't understand what could be the problem. Do you have an idea please ?

Best, Jerome

jrevillard commented 4 years ago

@tiredofit did you already face this issue please.... this is really had to maintain in prod with this issue...

Thanks. Jérôme

tiredofit commented 4 years ago

Hi Jérôme, I'm not actually seeing this in our production environment - What I will do is review my internal image vs this public image to make sure they are in sync and push a new copy of tiredofit/openldap and rebuild tiredofit/openldap-fusiondirectory. Sorry to hear you are having this issue.

I just realized I had changed this repo but have never pushed to github. What are you using for TLS Ciphers?

jrevillard commented 4 years ago

I'm using the default image TLS Ciphers

Thanks for your help.

tiredofit commented 4 years ago

Images are building now. While we wait.. I'm wondering here - Is your Nginx Reverse proxy / Letsencrypt companion working with a different cipher suite? That very well be the root cause of this. I originally built this to create a sort of workaround for using nginx inside the container to use http challenge authentication. I actually have switched to DNS based authentication and use Traefik now, but we ran this image successfully for near 2 years with TLS. Maybe go up the chain a bit and investigate your proxy/letsencrypt challenge services for some answers.

tiredofit commented 4 years ago

New images for both tiredofit/openldap and tiredofit/openldap-fusiondirectory (:latest) available.

jrevillard commented 4 years ago

Thanks a lot. I updated... let's see tomorrow morning if the error disappear by magic :-)/

Concerning reverse proxy, I also use Traefik but do not use it for the ldap.. I explicitly setup:

        - "traefik.enable=false"

Another information is that I run it inside a swarm cluster.

Do you use the traefik network for your ldap ? If yes, could you give me some sample configuration ?

Best, Jerome

tiredofit commented 4 years ago

Sadly I don't have any experience running in swarm, although I'm glad it's for the most part working :) Does that eliminate the requirement of having to use replication?

I'm sure you could run Traefik w/Openldap routing through 80/443 but that would cause some headaches for your ldap clients relying on either 389/636 as a destination port. It sounds as if you may already be doing what I'm suggesting: What I would do with Traefik would be the following, if you wanted to use LE certificates, would be to setup a seperate service that really did nothing (say a basic nginx install) and then have that one collect the certificate for you, then you could disable the in image the ability to have nginx running (which was there to allow for nginx-proxy to pick up a certificate) and then also keep traefik from being enabled on your openldap container, then, you would want to explode your acme.json file to seperate the key's for your openldap service..

Hopefully though as you mentioned, this new image might just solve what's happening. I believe in our organization we've been using since 2.4.43 very similar configuration and haven't had any reports of certificate issues.

jrevillard commented 4 years ago

Ok, so this morning, same issue, but I just realized that, even if I do not use the Traefik functionality, I was using the Treafik network! I just created a dedicated network for the ldap and restarted... let's cross fingers.

Concerning, swarm, yes it works well, with underlying glusterfs volumes. I still need replication because I installed it on 2 different sites for disaster recovery.

Best, Jerome

jrevillard commented 4 years ago

I think I made some progress. Switching to the other ldap for the services which cannot use multiple servers seems to solve the issue. so my problem only occurs on one site. I just saw that the docker version, openssl version etc.. are not the same so I updated everything... let's seem tomorrow morning.

Best, Jerome

jrevillard commented 4 years ago

Ok sorry for the silence...

So I updated the 2 swarm cluster at the same version, I still have the problem... I wonder if it could be related to some overload issue (I don't thing so but just in case...)... I will redirect all the flow to the second ldap server to see if I can reproduce. Actually, the problem still appears only on 1 ldap...

jrevillard commented 4 years ago

Just to complete... it occurs on the 2 ldaps now .... I'm applying Docker, OS patches regularly to see if it solves the problem at some time.

jrevillard commented 4 years ago

Hi @tiredofit , just to let you know that, since a couple of week, after another OS update on our Swarm cluster, everything is working well, no more issue.

Best, Jerome