piraeusdatastore / piraeus-operator

The Piraeus Operator manages LINSTOR clusters in Kubernetes.
https://piraeus.io/
Apache License 2.0
383 stars 60 forks source link

TLS for DRBD Replication doesn't work #614

Open retinio opened 6 months ago

retinio commented 6 months ago

Hi! I try to configure tls for DRBD by this manual TLS for internal traffic is enable:

kind: LinstorSatelliteConfiguration
spec:
  internalTLS:
     tlsHandshakeDaemon: true
     secretName: linstor-satellite-internal-tls

λ kubectl exec -n linstor deploy/linstor-controller -- linstor node list +---------------------------------------------------------------+ | Node | NodeType | Addresses | State | |======================================| | worker-01 | SATELLITE | 192.168.160.20:3367 (SSL) | Online | | worker-02 | SATELLITE | 192.168.160.21:3367 (SSL) | Online | | worker-03 | SATELLITE | 192.168.160.22:3367 (SSL) | Online | +---------------------------------------------------------------+

But drdb doesn't connect to each other λ kubectl exec -n linstor deploy/linstor-controller -- linstor r l +-------------------------------------------------------------------------------------------------------------------------------------------+ | ResourceName | Node | Port | Usage | Conns | State | | |=================================================================================== | pvc-4973e04e-44cf-49fe-9094-98dfbfda10d5 | worker-01 | 7000 | Unused | StandAlone(worker-03,worker-02) | UpToDate | | pvc-4973e04e-44cf-49fe-9094-98dfbfda10d5 | worker-02 | 7000 | Unused | StandAlone(worker-03,worker-01) | TieBreaker | | pvc-4973e04e-44cf-49fe-9094-98dfbfda10d5 | worker-03 | 7000 | InUse | StandAlone(worker-01,worker-02) | UpToDate | +-------------------------------------------------------------------------------------------------------------------------------------------+ ktls-utils containers have errors: λ kubectl -n linstor logs -l app.kubernetes.io/component=linstor-satellite -c ktls-utils

tlshd[29]: gnutls: The TLS connection was non-properly terminated. (-110)
tlshd[29]: Handshake with 'worker-02' (192.168.160.21) failed
tlshd[34]: gnutls: Error in the certificate. (-43)
tlshd[34]: Handshake with 'worker-03' (192.168.160.22) failed
tlshd[32]: gnutls: Error in the certificate. (-43)
tlshd[32]: Handshake with 'worker-02' (192.168.160.21) failed
tlshd[33]: gnutls: The TLS connection was non-properly terminated. (-110)
tlshd[33]: Handshake with 'worker-02' (192.168.160.21) failed
tlshd[35]: gnutls: The TLS connection was non-properly terminated. (-110)
tlshd[35]: Handshake with 'worker-03' (192.168.160.22) failed
tlshd[28]: gnutls: The TLS connection was non-properly terminated. (-110)
tlshd[28]: Handshake with 'worker-01' (192.168.160.20) failed
tlshd[32]: gnutls: Error in the certificate. (-43)
tlshd[32]: Handshake with 'worker-03' (192.168.160.22) failed

Piraeus Operator : 2.4.0 Host operating system: Almalinux 9 5.14.0-362.18.1.el9_3.x86_64 DRBD: version: 9.2.7 (api:2/proto:86-122)

retinio commented 6 months ago

I have enabled log in tlshd.conf

[debug]
loglevel=1
tls=1
nl=1

and I have got extended logs λ kubectl -n linstor logs linstor-satellite.worker-01-7wgmg -c ktls-utils

tlshd[7]: Built from ktls-utils 0.10 on Oct  4 2023 07:26:06
tlshd[7]: x.509 priority string: SECURE256:+SECURE128:-COMP-ALL:-VERS-ALL:+VERS-TLS1.3:%NO_TICKETS:-CIPHER-ALL:+AES-256-GCM:+CHACHA20-POLY1305:+AES-128-GCM:+AES-128-CCM
tlshd[7]: PSK priority string: SECURE256:+SECURE128:-COMP-ALL:-VERS-ALL:+VERS-TLS1.3:%NO_TICKETS:-CIPHER-ALL:+AES-256-GCM:+CHACHA20-POLY1305:+AES-128-GCM:+AES-128-CCM:+PSK:+DHE-PSK:+ECDHE-PSK
tlshd[9]: Querying the handshake service
tlshd[8]: Querying the handshake service
tlshd[9]: Parsing a valid netlink message
tlshd[9]: No peer identities found
tlshd[9]: No certificates found
tlshd[9]: System config file: /etc/gnutls/config
tlshd[8]: Parsing a valid netlink message
tlshd[9]: Client x.509 truststore is /etc/tlshd.d/ca.crt
tlshd[8]: No peer identities found
tlshd[8]: No certificates found
tlshd[8]: System config file: /etc/gnutls/config
tlshd[8]: Client x.509 truststore is /etc/tlshd.d/ca.crt
tlshd[11]: Querying the handshake service
tlshd[11]: Parsing a valid netlink message
tlshd[8]: System trust: Loaded 1 certificate(s).
tlshd[11]: No peer identities found
tlshd[11]: No certificates found
tlshd[11]: System config file: /etc/gnutls/config
tlshd[11]: Server x.509 truststore is /etc/tlshd.d/ca.crt
tlshd[8]: Retrieved x.509 certificate from /etc/tlshd.d/tls.crt
tlshd[11]: System trust: Loaded 1 certificate(s).
tlshd[8]: Retrieved private key from /etc/tlshd.d/tls.key
tlshd[11]: Retrieved x.509 certificate from /etc/tlshd.d/tls.crt
tlshd[10]: Querying the handshake service
tlshd[10]: Parsing a valid netlink message
tlshd[10]: No peer identities found
tlshd[10]: No certificates found
tlshd[10]: System config file: /etc/gnutls/config
tlshd[10]: Server x.509 truststore is /etc/tlshd.d/ca.crt
tlshd[10]: System trust: Loaded 1 certificate(s).
tlshd[10]: Retrieved x.509 certificate from /etc/tlshd.d/tls.crt
tlshd[10]: Retrieved private key from /etc/tlshd.d/tls.key
tlshd[8]: Server's trusted authorities:
tlshd[9]: System trust: Loaded 1 certificate(s).
tlshd[8]:    [0]: CN=linstor-internal-ca
tlshd[11]: Retrieved private key from /etc/tlshd.d/tls.key
tlshd[8]: The certificate is NOT trusted. The name in the certificate does not match the expected.
tlshd[8]: gnutls: Error in the certificate. (-43)
tlshd[8]: Handshake with 'worker-03' (192.168.160.22) failed
DBG<1>././lib/cache_mngt.c:302  nl_cache_mngt_unregister: Unregistered cache operations genl/family
tlshd[9]: Retrieved x.509 certificate from /etc/tlshd.d/tls.crt
tlshd[9]: Retrieved private key from /etc/tlshd.d/tls.key
tlshd[9]: Server's trusted authorities:
tlshd[9]:    [0]: CN=linstor-internal-ca
tlshd[9]: The certificate is NOT trusted. The name in the certificate does not match the expected.
tlshd[9]: gnutls: Error in the certificate. (-43)
tlshd[9]: Handshake with 'worker-02' (192.168.160.21) failed
DBG<1>././lib/cache_mngt.c:302  nl_cache_mngt_unregister: Unregistered cache operations genl/family
tlshd[10]: gnutls: The TLS connection was non-properly terminated. (-110)
tlshd[11]: gnutls: The TLS connection was non-properly terminated. (-110)
tlshd[10]: Handshake with 'worker-03' (192.168.160.22) failed
tlshd[11]: Handshake with 'worker-02' (192.168.160.21) failed
DBG<1>././lib/cache_mngt.c:302  nl_cache_mngt_unregister: Unregistered cache operations genl/family
DBG<1>././lib/cache_mngt.c:302  nl_cache_mngt_unregister: Unregistered cache operations genl/family

λ kubectl -n linstor get secret linstor-satellite-internal-tls -o jsonpath="{.data['tls.crt']}" | base64 -d > tls.crt λ kubectl -n linstor get secret linstor-satellite-internal-tls -o jsonpath="{.data['ca.crt']}" | base64 -d > ca.crt λ openssl verify -CAfile ca.crt tls.crt tls.crt: OK

WanzenBug commented 6 months ago

Looks like you used the "openssl" method from here to create those certificates?

If so, the issue is that those certificates only set a generic common name:

openssl req -new -sha256 -key satellite.key -subj "/CN=linstor-satellite" -out satellite.csr

So with strict validation, this certificate is only valid for some entity named linstor-satellite. For LINSTOR itself this is fine, as we don't do strict hostname validation there, but for tlshd, it means that when it sees a DRBD connection for worker-01, but gets a certificate for linstor-satellite it simply fails the validation.

You either need to manually add all the node names to the alternative names in the certificates:

openssl req -new -sha256 -key satellite.key -subj "/CN=linstor-satellite" -out satellite.csr -addext "subjectAltName = DNS:linstor-satellite,DNS:worker-01,DNS:worker-02,DNS:worker-03"
openssl x509 -req -in satellite.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out satellite.crt -days 3650 -sha256 -copy_extensions copy

Or you use cert-manager and get that all automatically :smile:

retinio commented 6 months ago

@WanzenBug Thank you sooo much! Everything worked out. You might be interested. If I use the newest version of ktls-utils (0.10-6), the connection error still persists. λ kubectl -n linstor logs -l app.kubernetes.io/component=linstor-satellite -c ktls-utils

tlshd[12]: No peer identities found
tlshd[12]: No certificates found
tlshd[12]: System config file: /etc/gnutls/config
tlshd[12]: Server x.509 truststore is /etc/tlshd.d/ca.crt
tlshd[12]: System trust: Loaded 1 certificate(s).
tlshd[12]: Retrieved x.509 certificate from /etc/tlshd.d/tls.crt
tlshd[12]: Retrieved private key from /etc/tlshd.d/tls.key
tlshd[11]: System trust: Loaded 140 certificate(s).
tlshd[11]: Handshake with 'worker-02' (10.0.4.171) failed
DBG<1>././lib/cache_mngt.c:302  nl_cache_mngt_unregister: Unregistered cache operations genl/family
tlshd[11]: Server x.509 truststore is /etc/tlshd.d/ca.crt
tlshd[11]: System trust: Loaded 1 certificate(s).
tlshd[11]: Retrieved x.509 certificate from /etc/tlshd.d/tls.crt
tlshd[11]: Retrieved private key from /etc/tlshd.d/tls.key
tlshd[9]: System trust: Loaded 140 certificate(s).
tlshd[9]: Handshake with 'worker-01' (10.0.3.154) failed
DBG<1>././lib/cache_mngt.c:302  nl_cache_mngt_unregister: Unregistered cache operations genl/family
tlshd[10]: System trust: Loaded 140 certificate(s).
tlshd[10]: Handshake with 'worker-03' (10.0.5.212) failed
DBG<1>././lib/cache_mngt.c:302  nl_cache_mngt_unregister: Unregistered cache operations genl/family
tlshd[10]: Retrieved x.509 certificate from /etc/tlshd.d/tls.crt
tlshd[10]: Retrieved private key from /etc/tlshd.d/tls.key
tlshd[11]: Querying the handshake service
tlshd[11]: Parsing a valid netlink message
tlshd[11]: No peer identities found
tlshd[11]: No certificates found
tlshd[11]: System config file: /etc/gnutls/config
tlshd[11]: System trust: Loaded 140 certificate(s).
tlshd[11]: Handshake with 'worker-02' (10.0.4.171) failed
DBG<1>././lib/cache_mngt.c:302  nl_cache_mngt_unregister: Unregistered cache operations genl/family
WanzenBug commented 6 months ago

I'm wondering why it would try to load the system trust store:

tlshd[11]: System trust: Loaded 140 certificate(s).

But sometimes it loads the right certificates instead:

tlshd[12]: Server x.509 truststore is /etc/tlshd.d/ca.crt
tlshd[12]: System trust: Loaded 1 certificate(s).