svarshavchik / courier

Courier Mail Server
http://www.courier-mta.org
72 stars 12 forks source link

STARTTLS fails with the “certified” domain, but works with a wrong domain #48

Open andrejpodzimek opened 1 year ago

andrejpodzimek commented 1 year ago

This is copied from my rant on AUR.

Not sure if this is caused by version 1.2 of courier-mta or version 3.0.x of openssl, but courier-mta currently has STARTTLS inoperable unless you connect to the server using a domain name that mismatches the one in the certificate(s) (which makes little sense, i.e. STARTTLS is basically inoperable). The bug is tricky, because:

TLS STARTTLS
when requesting the certified domain name works fails
requesting a bogus domain name (resolving to the mail server’s address) “works” “works”

This↑ can be reproduced using (1) Thunderbird, (2) R2Mail2 and (3) openssl s_client as a client. It affects both IMAP and SMTP. For s_client in particular, this is how you can test your server:

# This will fail and exit immediately (with or without error, at random):
openssl s_client -starttls imap -crlf -connect domain.in.certificate:143
openssl s_client -starttls smtp -crlf -connect domain.in.certificate:25

# This will “work”; try to enter (e.g.) '1 capability' for IMAP or 'EHLO blah' for SMTP:
openssl s_client -starttls imap -crlf -connect domain.NOT.in.certificate:143
openssl s_client -starttls smtp -crlf -connect domain.NOT.in.certificate:25

The error symptom is either an abrupt connection termination with no further output or, sometimes, this error:

0052A735227F0000:error:0A00010B:SSL routines:ssl3_get_record:wrong version number:../ssl/record/ssl3_record.c:358:

The leading string seems random, the stuff after : is stable.

For easier end-to-end debugging, I’ve used a trivial IMAP client. It establishes a STARTTLS connection to an IMAP server, authenticates using cram-sha256 and reads the mailbox status. As already mentioned, setting server to a domain name listed in the certificate fails and setting it to a bogus domain that resolves to the mail server’s IP address (but is not in its certificate) succeeds.

This looks like a critical bug, because it renders opportunistic STARTTLS security over SMTP’s port 25 inoperable. TLS on 465 works perfectly fine. For IMAP the obvious workaround is to use IMAP over TLS on 993 and give up on STARTTLS entirely.

I’ve tried to rebuild and restart courier-mta, with and without Arch’s openssl-1.1 package installed (and with the default openssl 3.0.7 always installed), but there is no difference; STARTTLS is (kind of) gone.

andrejpodzimek commented 1 year ago

Looking at git diff courier/1.1.10/20220606180754 courier/1.2.0/20221202210553, the idn2 migration stands out, because this is somehow domain-name-related. But I can’t spot anything “suspicious” at the first glance and my domain or certificate doesn’t contain any non-ASCII characters or anything highly idn2-relevant.

BTW, idn2 is “stricter” than idn, it seems, but I don’t have dashes or other problematic characters anywhere (neither in any of the domain names, nor in the domain-specific certificate file / symlink names).

andrejpodzimek commented 1 year ago

Built and tested Courier 1.1.10 and also 1.1.11. Still the same problem. So it’s not the idn2, I suppose. It may be OpenSSL-related.

I’d like to rebuild Courier with OpenSSL 1.1.1s to figure out whether the recent upgrade to 3.0.7 (in Arch) could be to blame, but can’t get that to compile. Environment variables added in PKGBUILD:

LDFLAGS+=",-L/usr/lib/courier-authlib,-L/usr/lib/openssl-1.1 -lcourierauth"
CPPFLAGS="-I/usr/include/openssl-1.1 ${CPPFLAGS}"
CFLAGS="-I/usr/include/openssl-1.1 ${CFLAGS}"

Error:

libcouriertls.c: In function 'load_dh_params':
libcouriertls.c:420:17: error: unknown type name 'OSSL_LIB_CTX'
  420 |                 OSSL_LIB_CTX *libctx=OSSL_LIB_CTX_get0_global_default();
      |                 ^~~~~~~~~~~~
libtool: link: ranlib .libs/libspipe.a
libcouriertls.c:420:38: warning: implicit declaration of function 'OSSL_LIB_CTX_get0_global_default' [-Wimplicit-function-declaration]
  420 |                 OSSL_LIB_CTX *libctx=OSSL_LIB_CTX_get0_global_default();
      |                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
libcouriertls.c:420:38: warning: initialization of 'int *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
mv -f .deps/tlscachetest.Tpo .deps/tlscachetest.Po
libtool: link: ( cd ".libs" && rm -f "libspipe.la" && ln -s "../libspipe.la" "libspipe.la" )
/bin/sh ./libtool  --tag=CC   --mode=link gcc  -I./.. -I.. -I./../.. -I../.. -Wall -I/usr/include/openssl-1.1 -march=native -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -static -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now,-L/usr/lib/courier-authlib,-L/usr/lib/openssl-1.1 -lcourierauth -o tlscachetest tlscachetest.o ../numlib/libnumlib.la ../liblock/liblock.la 
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I/usr/include/openssl-1.1 -I/usr/include/p11-kit-1 -I./.. -I.. -I./../.. -I../.. -Wall -I/usr/include/openssl-1.1 -march=native -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -MT tlsclient.lo -MD -MP -MF .deps/tlsclient.Tpo -c tlsclient.c -o tlsclient.o >/dev/null 2>&1
libcouriertls.c:422:32: warning: implicit declaration of function 'PEM_read_bio_Parameters_ex'; did you mean 'PEM_read_bio_Parameters'? [-Wimplicit-function-declaration]
  422 |                 EVP_PKEY *pkey=PEM_read_bio_Parameters_ex(bio, NULL, libctx,
      |                                ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                PEM_read_bio_Parameters
libcouriertls.c:422:32: warning: initialization of 'EVP_PKEY *' {aka 'struct evp_pkey_st *'} from 'int' makes pointer from integer without a cast [-Wint-conversion]
libcouriertls.c:427:29: warning: implicit declaration of function 'EVP_PKEY_is_a'; did you mean 'EVP_PKEY_sign'? [-Wimplicit-function-declaration]
  427 |                         if (EVP_PKEY_is_a(pkey, "DH"))
      |                             ^~~~~~~~~~~~~
      |                             EVP_PKEY_sign
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I/usr/include/openssl-1.1 -I/usr/include/p11-kit-1 -I./.. -I.. -I./../.. -I../.. -Wall -I/usr/include/openssl-1.1 -march=native -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -MT tlscache.lo -MD -MP -MF .deps/tlscache.Tpo -c tlscache.c -o tlscache.o >/dev/null 2>&1
mv -f .deps/starttls.Tpo .deps/starttls.Po
libcouriertls.c:429:37: warning: implicit declaration of function 'SSL_CTX_set0_tmp_dh_pkey'; did you mean 'SSL_CTX_set_tmp_dh'? [-Wimplicit-function-declaration]
  429 |                                 if (SSL_CTX_set0_tmp_dh_pkey(ctx, pkey))
      |                                     ^~~~~~~~~~~~~~~~~~~~~~~~
      |                                     SSL_CTX_set_tmp_dh
In file included from libcouriertls.h:28,
                 from libcouriertls.c:9:
libcouriertls.c: In function 'tls_create_int':
/usr/include/openssl-1.1/openssl/ssl.h:1496:52: warning: statement with no effect [-Wunused-value]
 1496 | #  define SSL_CTX_set_ecdh_auto(dummy, onoff)      ((onoff) != 0)
      |                                                    ^
libcouriertls.c:1082:9: note: in expansion of macro 'SSL_CTX_set_ecdh_auto'
 1082 |         SSL_CTX_set_ecdh_auto(ctx, 1);
      |         ^~~~~~~~~~~~~~~~~~~~~
andrejpodzimek commented 1 year ago

Side note: Tried --with-gnutls to check how that would work, but it seems to be just broken into pieces regardless domain names or other things:

$ openssl s_client -starttls imap -crlf -connect my.server.domain:143 
CONNECTED(00000003)
write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 631 bytes and written 345 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
$ openssl s_client -starttls smtp -crlf -connect my.server.domain:25
CONNECTED(00000003)
00D246DAF67F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:../ssl/record/rec_layer_s3.c:320:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 206 bytes and written 359 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
svarshavchik commented 1 year ago

The configure script is still picking up OpenSSL 3.0 headers, and configuring the build for OpenSSL 3, but the code ends up using OpenSSL 1.1 header files to compile, this is the reason for the compilation error.

I am unfamiliar with Arch's build framework to offer any pointers.

andrejpodzimek commented 1 year ago

Next things I’ve tried, to no avail:

  1. Capture the output from this during a successful STARTTLS (with a nonexistent domain) and during a failed STARTTLS (with one of the “certified” domains):

    pids=($(pidof couriertcpd)); strace -f "${pids[@]/#/-p}"

    No surprises. Extra accesses to /etc/courier/imapd.pem.my.server.domain occur in the latter case. A failure in the latter case (in strace) is not obvious, but there is some log write like this one (also the only “failed” in that output):

    [pid 4175639] write(1, ". NO STARTTLS failed: ip=[2620:0"..., 192 <unfinished ...>

    Boldly assuming it’s still the same descriptor, writes that preceded that ↑ were:

    [pid 4175639] write(1, "* OK [CAPABILITY IMAP4rev1 UIDPL"..., 339) = 339
    [pid 4175639] write(1, "* CAPABILITY IMAP4rev1 UIDPLUS C"..., 255) = 255
    [pid 4175639] write(1, ". OK Begin SSL/TLS negotiation n"..., 37) = 37

    Now a “successful” case (with a bogus domain) still has these↑ lines, but not the ultimate . NO STARTTLS failed: line.

  2. TLS_PROTOCOL=TLSv1.2+ instead of TLS_PROTOCOL=TLSv1.2++: Nothing changed. So the ban on client-initiated re-negotiation is not to blame either.

  3. Splitting key and certificate files (just in case something is wrong with the file accesses): I used to have a TLS_CERTFILE with the private key (also) in it. Now I have TLS_PRIVATE_KEYFILE and a certificate-only TLS_CERTFILE. But again, nothing changed.

In any case, direct TLS without STARTTLS (other ports, but the same Courier instance + config) works perfectly fine. This is weird. No clue what I’m missing.

Ad Courier and OpenSSL in Arch: This is the ./configure command. Is there a ./configure option that could change the SSL include path? OpenSSL 1.1.1s is installed in /usr/{include,lib}/openssl-1.1. Whatever I can hack in the sources is easy to test with makepkg; I’m just clueless as to what to hack.

svarshavchik commented 1 year ago

./configure reads the CFLAGS, CXXFLAGS, et. al., environment variables. They can also be passed in, explicitly, as additional parameters: CFLAGS=... CXXFLAGS=... to configure.

andrejpodzimek commented 1 year ago

./configure reads the CFLAGS, CXXFLAGS, et. al., environment variables.

Well, I exported those (with values listed above) in the PKGBUILD right before the ./configure and they did appear in the compiler commands printed out during make. But then I got that error nonetheless, so there must be something that still looks for OpenSSL in the default system paths or otherwise assumes OpenSSL 3.

svarshavchik commented 1 year ago

I would then double-check the actual parameters that get passed to the compiler. make V=1 builds and show each command that gets invoked, with all the options. The exact options, -I, and all others, can be ascertained from that.

andrejpodzimek commented 1 year ago

I don’t think there is an issue with -I. This looks like a problem during ./configure, not during make. The missing OSSL_LIB_CTX indicates that OpenSSL 1.1.x headers are included (as desired), but the code expects OpenSSL 3.x. To overcome the missing OSSL_LIB_CTX, this hack is needed:

sed -i \
  's/"#define HAVE_PEM_READ_BIO_PARAMETERS_EX 1"/"#define HAVE_PEM_READ_BIO_PARAMETERS_EX 0"/' \
  libs/tcpd/configure

Otherwise this gets compiled in and requires OSSL_LIB_CTX. The function this checks for (in tcpd/configure) is called PEM_read_bio_Parameters_ex. It is available in OpenSSL 3.x, but not in OpenSSL 1.1.x.

So while -I and -L are set correctly during compilation and linking, the testing code snippets in the ./configure stage are likely not getting the OpenSSL version override and are using the system default 3.0.7 instead.

Even after hacking around the OSSL_LIB_CTX requirement in tcpd/configure it won’t link, due to a missing SSL_get_peer_certificate:

libtool: link: gcc -I./.. -I.. -I./../.. -I../.. -Wall -I/usr/include/openssl-1.1 -march=native -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z -Wl,relro -Wl,-z -Wl,now -Wl,-L/usr/lib/openssl-1.1 -Wl,-L/usr/lib/courier-authlib -o couriertls starttls.o argparse.o  ./.libs/libcouriertls.a -lssl -lcrypto ./.libs/libspipe.a ../rfc1035/librfc1035.a ../md5/.libs/libmd5.a ../random128/.libs/librandom128.a ../numlib/.libs/libnumlib.a ../liblock/.libs/liblock.a -lcourierauth -lidn2 ../soxwrap/libsoxwrap.a
/usr/bin/ld: ./.libs/libcouriertls.a(libcouriertls.o): in function `tls_dump_connection_info':
libcouriertls.c:(.text+0x2bb5): undefined reference to `SSL_get_peer_certificate'

This↑ error is utterly bogus, because that symbol does exist in OpenSSL 1.1.x (and not in OpenSSL 3.x):

$ strings /usr/lib/openssl-1.1/libssl.so | grep SSL_get_peer_certificate  # 1.1.1s
SSL_get_peer_certificate

$ strings /usr/lib/libssl.so | grep SSL_get_peer_certificate  # 3.0.7

The whole build procedure, for the record:

CFLAGS="-I/usr/include/openssl-1.1 ${CFLAGS}"
CXXFLAGS="-I/usr/include/openssl-1.1 ${CXXFLAGS}"
CPPFLAGS="$CXXFLAGS"
LDFLAGS+=',-L/usr/lib/openssl-1.1,-L/usr/lib/courier-authlib -lcourierauth'
export CFLAGS CPPFLAGS CXXFLAGS LDFLAGS

sed -i \
  's/"#define HAVE_PEM_READ_BIO_PARAMETERS_EX 1"/"#define HAVE_PEM_READ_BIO_PARAMETERS_EX 0"/' \
  libs/tcpd/configure

./configure --prefix=/usr \
  --sbindir=/usr/bin \
  --sysconfdir=/etc/courier \
  --libdir=/usr/lib \
  --libexecdir=/usr/lib \
  --localstatedir=/var/spool/courier \
  --enable-unicode \
  --enable-workarounds-for-imap-client-bugs \
  --enable-mimetypes=/etc/mime.types \
  --with-piddir=/run/courier \
  --with-trashquota \
  --with-db=gdbm \
  --with-random=/dev/urandom \
  --without-ispell \
  --with-mailuser=courier \
  --with-mailgroup=courier \
  --with-certdb=/etc/ssl/certs/ \
  --with-notice=unicode \
  "CFLAGS=${CFLAGS}" \
  "CPPFLAGS=${CPPFLAGS}" \
  "CXXFLAGS=${CXXFLAGS}" \
  "LDFLAGS=${LDFLAGS}"

make V=1
svarshavchik commented 1 year ago

This is not a correct interpretation.

This is a standard autoconf-generated test:

AC_CHECK_FUNCS(PEM_read_bio_Parameters_ex)

configure attempts to link with a dummy program that calls this function. If the link succeeds, #define HAVE_PEM_READ_BIO_PARAMETERS_EX 1 gets defined. If the link fails, this is not defined.

The environment which runs the configure script ends up with the linker finding the OpenSSL 3 library. But the compilation environment is pointing to OpenSSL 1.

andrejpodzimek commented 1 year ago

The environment which runs the configure script ends up with the linker finding the OpenSSL 3 library. But the compilation environment is pointing to OpenSSL 1.

That doesn’t seem to contradict what I said above: The ./configure script basically ignores my attempts to override -I and -L for OpenSSL when running its decision-making code snippets and keeps using the default OpenSSL there.

OTOH, ./configure does include my explicitly specified -I and -L in the generated Makefiles.

That↑ way the make phase (indeed) has the correct (overridden) -I and -L for OpenSSL 1.1.1s, which mismatches ./configure’s decisions taken based on OpenSSL 3.0.7 (and written into config headers).

What is the best place to forcibly inject my -I and -L (also) into ./configure script’s “dummy programs” (auto-generated snippets) compilation, so that OpenSSL 1.1.x is used there too?

svarshavchik commented 1 year ago

I think it's a matter of using the right environment variables.

AC_CHECK_FUNCS appear to compile the test program as C, using CFLAGS, CPPFLAGS, and LDFLAGS.

This is a matter of strictly enforcing the right variables: -I goes into CPPFLAGS. -l, -L goes into LDFLAGS.

Both C and C++ compilations use CPPFLAGS, for preprocessor-related compiler flags, and LDFLAGS for linker flags.

Stuffing everything into CXXFLAGS is just the lazy way out that works most of the time. Except when it doesn't.

andrejpodzimek commented 1 year ago

Stuffing everything into CXXFLAGS is just the lazy way out that works most of the time. Except when it doesn't.

What is this↑ referring to? I don’t see that here↓ in my hack; LDFLAGS and CPPFLAGS are separate…

CFLAGS="-I/usr/include/openssl-1.1 ${CFLAGS}"
CXXFLAGS="-I/usr/include/openssl-1.1 ${CXXFLAGS}"
CPPFLAGS="$CXXFLAGS"
LDFLAGS+=',-L/usr/lib/openssl-1.1,-L/usr/lib/courier-authlib -lcourierauth'
export CFLAGS CPPFLAGS CXXFLAGS LDFLAGS

CXXFLAGS is an Arch Linux thing set in makepkg.conf, which also happens to occur in Courier’s sources (not sure if with the same meaning). So I’m passing it through.

When I unset CXXFLAGS and remove it also form configure’s command line (while leaving the rest of the hack around), it makes no difference in terms of errors. The same errors happen with flags minimized like this (i.e. no defaults propagated from /etc/makepkg.conf):

CFLAGS=
CPPFLAGS='-I/usr/include/openssl-1.1'
LDFLAGS='-Wl,-L/usr/lib/openssl-1.1,-L/usr/lib/courier-authlib,-lcourierauth'
unset CXXFLAGS
export CFLAGS CPPFLAGS LDFLAGS
andrejpodzimek commented 1 year ago

strace has just revealed something (with -s 5000): Courier claims that a successfully read certificate file does not exist. This is my setup on the server:

pids=($(pidof couriertcpd))
strace -s 5000 -f "${pids[@]/#/-p}"

This runs on the client (OpenSSL 3.0.7 on both sides):

openssl s_client -starttls imap -crlf -connect imap.somedomain.org:143

What I see is a process that successfully reads /etc/courier/imapd.pem.imap.somedomain.org twice in its entirety (4021 bytes), yet claims afterwards that the file does not exist:

[pid 45606] openat(AT_FDCWD, "/etc/courier/imapd.pem.imap.somedomain.org", O_RDONLY) = 6
[pid 45606] newfstatat(6, "", {st_mode=S_IFREG|0440, st_size=4021, ...}, AT_EMPTY_PATH) = 0
[pid 45606] read(6, "-----BEGIN CERTIFICATE-----\n >>> Server’s certificate is read here! <<< \n-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\n >>> CA’s certificate follows here! <<< \n-----END CERTIFICATE-----\n", 4096) = 4021
[pid 45606] read(6, "", 4096)           = 0
[pid 45606] close(6)                    = 0
[pid 45606] openat(AT_FDCWD, "/etc/courier/imapd.pem.imap.somedomain.org", O_RDONLY) = 6
[pid 45606] lseek(6, 0, SEEK_CUR)       = 0
[pid 45606] lseek(6, 0, SEEK_CUR)       = 0
[pid 45606] lseek(6, 0, SEEK_CUR)       = 0
[pid 45606] lseek(6, 0, SEEK_CUR)       = 0
[pid 45606] newfstatat(6, "", {st_mode=S_IFREG|0440, st_size=4021, ...}, AT_EMPTY_PATH) = 0
[pid 45606] lseek(6, 0, SEEK_SET)       = 0
[pid 45606] read(6, "-----BEGIN CERTIFICATE-----\n >>> Server’s certificate is read here! <<< \n-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\n >>> CA’s certificate follows here! <<< \n-----END CERTIFICATE-----\n", 4096) = 4021
[pid 45606] lseek(6, 4021, SEEK_SET)    = 4021
[pid 45606] read(6, "", 4096)           = 0
[pid 45606] newfstatat(7, "", {st_mode=S_IFIFO|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
[pid 45606] write(7, "ip=[2620:x:x:x:x:x:x:x], couriertls: /etc/courier/imapd.pem.imap.somedomain.org: No such file or directory\n", 121) = 121
[pid 45606] close(6)                    = 0

“No such file or directory” for something that has been successfully opened and read…? Next another process, the previous one’s parent over a pipe, generates an IMAP message with . NO STARTTLS failed: ...:

[pid 45604] write(1, ". NO STARTTLS failed: ip=[2620:x:x:x:x:x:x:x], couriertls: /etc/courier/imapd.pem.imap.somedomain.org: No such file or directory\r\n* NO Error in IMAP command received by server.\r\n", 192 <unfinished ...>

This↑ error does not occur on successful TLS connections, only on failed STARTTLS. Stating the obvious, the file exists, is a regular file, is readable for courier, openssl x509 can parse it and has never caused problems … until now.

$ sudo ls -al /etc/courier/imapd.pem{,.imap.somedomain.org}
lrwxrwxrwx 1 courier courier   27 Nov 24  2020 /etc/courier/imapd.pem -> imapd.pem.imap.somedomain.org
-r--r----- 1 courier courier 4021 Dec 16 03:12 /etc/courier/imapd.pem.imap.somedomain.org

$ { openssl x509 -text; openssl x509 -text; } < /etc/courier/imapd.pem.imap.somedomain.org
Certificate: ... both certificates are read just fine ...

What could be causing the bogus “No such file or directory” error?

(Because I’m unable to rebuild Courier with OpenSSL 1.1.1s (due to the ./configure problem mentioned earlier), this is the only debugging clue I have at the moment.)

svarshavchik commented 1 year ago

Error handling is a long time design weakness in the OpenSSL API. When an OpenSSL API call fails no specific error indication gets returned, rather the application calls ERR_get_error to retrieve the last reported library error, and if no error code gets returned then the call must've failed due to the a failed system call, so read errno.

However if there's a failed API call but there is no error code that gets logged and returned from ERR_get_error a system error message gets mistakenly logged. Some prior syscall failed with ENOENT. errno never gets cleared automatically, so a misleading error then gets logged.

Additionally what sometimes happens is that the OpenSSL library changes some of its error codes, and applications that rely on specific error codes break.

The first time the certificate file gets read is by OpenSSL itself, when it gets installed into the SSL context. Courier's code also supports loading custom DH parameters from the PEM formatted file. If it's missing, the expected error code is PEM_R_NO_START_LINE. That's this code in libcouriertls.c:

            /*
            ** If the certificate file does not have DH parameters,
            ** swallow the error.
            */

            int err=ERR_peek_last_error();

            if (ERR_GET_LIB(err) == ERR_LIB_PEM
                && ERR_GET_REASON(err) == PEM_R_NO_START_LINE)
            {
                ERR_clear_error();
            }
            else
            {
                sslerror(info, filename, -1);
            }

But if this is where the erroneous error gets logged then there should be a pending error code, this only peeks at the error and does not remove it from the error queue.

One way to test this hypothesis is to temporarily replace the call with something like:

   sslerror(info, "*** trap ***", -1);

and if this now gets logged instead of the filename then this must be the reason, and OpenSSL changed the error code again. Some extra work will need to be done in order to determine what the error code is, and update the code to check for it.

Another alternative would be to simply add your own DH parameters to the certificate file. There's an internal script in the package, mkdhparams:

TLS_DHPARAMS=/tmp/dhparams.pem mkdhparams

and the output can simply be concatenated to the certificate file.

Of course all of this presumes that the dh parameter load is the problem here.

andrejpodzimek commented 1 year ago

Of course all of this presumes that the dh parameter load is the problem here.

Looks like it is. It is not obvious from the strace though. So, I have TLS_DHPARAMS=/etc/courier/dhparams.pem. That’s a regular file, owned and readable by courier:courier. It is regenerated monthly using openssl dhparam -out /etc/courier/dhparams.pem 4096. (TIL that -rand is not required any more.) (My keys have 4096 bits and it is recommended to pick an equal length here.) In strace that file is opened and successfully read ~3 times. No domain-specific name suffixes are probed, so I assume they need not exist; D-H parameters are “global”.

Anyhow: This fixes the problem:

cat /etc/courier/dhparams.pem >> /etc/courier/imapd.pem

TLS worked before and works now. STARTTLS was failing before (without the concatenation) and works again now.

Phew. Thanks for the pointer. I would have never thought it could be something with the D-H parameters!

svarshavchik commented 1 year ago

TLS_DHPARAMS in the configuration file can be set to point to a discrete DH parameters file.

But it should, in theory, work without it. Something is not working right in OpenSSL. I'll try to reproduce this myself, and see if I can figure it out.

svarshavchik commented 1 year ago

This has now been fixed.

andrejpodzimek commented 1 year ago

A note on OpenSSL 3.0.8+ for future readers: The workaround must be removed. The trick that helped before (appending the dhparams to the certificate chain) will now cause all STARTTLS connections to fail + reset. Keeping the dhparams as a separate file again (as it should be) restores everything back to normal.

andrejpodzimek commented 8 months ago

Just upgraded from 1.3.2 to 1.3.4, which coincides with an OpenSSL upgrade from 3.1.1 to 3.1.4. The problem is back. :fearful: The symptoms are almost exactly the same.

In the IMAP case the error on s_client side is (sometimes):

4077B765FD7E0000:error:0A00010B:SSL routines:ssl3_get_record:wrong version number:ssl/record/ssl3_record.c:358:

In the ESMTP case the error in the logs is (always):

courieresmtpd: STARTTLS failed: ip=[::1], couriertls: /etc/courier/esmtpd.pem.smtp.my.domain: error:1E08010C:DECODER routines::unsupported
Type Command: openssl s_client -crlf … Result Note
IMAP + STARTTLS -starttls imap -connect foo.my.domain:143 WORKS no certificate for subdomain
IMAP + TLS -connect foo.my.domain:993 WORKS no certificate for subdomain
IMAP + STARTTLS -starttls imap -connect imap.my.domain:143 FAILS certificate exists
IMAP + TLS -connect imap.my.domain:993 WORKS certificate exists
ESMTP + STARTTLS -starttls smtp -connect foo.my.domain:25 WORKS no certificate for subdomain
ESMTP + TLS -connect foo.my.domain:465 WORKS no certificate for subdomain
ESMTP + STARTTLS -starttls smtp -connect smtp.my.domain:25 FAILS certificate exists
ESMTP + TLS -connect smtp.my.domain:465 WORKS certificate exists

I’m going to retry the workaround, but it’s also possible that it won’t work any more and the problem is different…

andrejpodzimek commented 8 months ago

The workaround works! :smile:

The “FAILS” entries above are now working again after I appended the stuff -----BEGIN DH PARAMETERS----------END DH PARAMETERS----- to my certificate file.

All my configs have basically these files set:

TLS_CERTFILE=/etc/courier/esmtpd.pem
TLS_PRIVATE_KEYFILE=/etc/courier/esmtpd.key
TLS_DHPARAMS=/etc/courier/dhparams.pem

But as part of the workaround, the TLS_CERTFILE=/etc/courier/esmtpd.pem now looks like this:

-----BEGIN CERTIFICATE-----
<<< my certificate >>>
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
<<< LetsEncrypt’s intermediate certificate >>>
-----END CERTIFICATE-----
-----BEGIN DH PARAMETERS-----
<<< The stuff from /etc/courier/dhparams.pem >>>
-----END DH PARAMETERS-----

I’m not quite sure what’s happening here… Would you consider reopening this or should I file a new bug? Is there anything I can do to help with debugging? Can this be an upstream OpenSSL bug like last time?

For bisecting it would be nice to have a minimalistic thing that exhibits the problem but does nothing else. Like a tiny ping server based on couriertls. Can this be done? Can one run it with something like just four mkfifo pipes (plain, encrypted) × (input, output) to check the basics?

svarshavchik commented 8 months ago

I tried to reproduce this with OpenSSL 3.0.9 and I was unable to reproduce this. I'll keep trying.