nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16k stars 1.41k forks source link

TLS Route handshake failure for OpenShift route #6143

Open swb2-izu-ssp opened 3 days ago

swb2-izu-ssp commented 3 days ago

Observed behavior

Hello,

I am trying to put in place the following pattern: Stretch cluster, coming from https://www.synadia.com/blog/multi-cluster-consistency-models/

To do so, servers re-unification dispatched over several paases must be done within a single cluster.

The solution I have found so far is to use openshift route.

What I have tried before using route, is to deploy servers on different namespaces, but same paas, and use the headless service only for the route urls.

 routes: 
  - tls://nats.nats-stretch-1.svc.cluster.local:6222
  - tls://nats.nats-stretch-2.svc.cluster.local:6222
  - tls://nats.nats-stretch-3.svc.cluster.local:6222

it is working fine.

So I tried to extend this with openshift routes (with tls termination passthrough + routing to port 6222) only for the remote connectivity (locally, I am using Pod+headless service) and still same configuration (different namespaces and same paas)

 routes: 
  - tls://stretch-cluster-0.nats.nats-stretch-3.svc.cluster.local:6222
  - tls://stretch-cluster-1.nats.nats-stretch-3.svc.cluster.local:6222
  - tls://stretch-cluster-2.nats.nats-stretch-3.svc.cluster.local:6222
  - tls://nats-nats-stretch-1.mydomain.com:443
  - tls://nats-nats-stretch-2..mydomain.com:443

I started having issue with TLS because it is trying to connect with IP at the end, and not the hostname I have provider

[7] 2024/11/18 16:33:21.883403 [DBG] Attempting reconnect for solicited route "nats-route://IP_A:6222/"
[7] 2024/11/18 16:33:21.887386 [DBG] IP_A:6222 - rid:15540 - Starting TLS route client handshake
[7] 2024/11/18 16:33:21.889737 [ERR] IP_A:6222 - rid:15540 - TLS route handshake error: tls: failed to verify certificate: x509: cannot validate certificate for IP_A because it doesn't contain any IP SANs
[7] 2024/11/18 16:33:21.889766 [INF] IP_A:6222 - rid:15540 - Router connection closed: TLS Handshake Failure

So it seems, with route, it is not passing tlsName for tls validation. And of course, my certificate does not contain Ip adresses. I would suspect this piece of code: https://github.com/nats-io/nats-server/blob/main/server/client.go#L5905 But this is a fast and lazy check, I have done...

Expected behavior

I would expect to see it works also, same as headless service usage.

Server and client version

Service: 2.10.21

Host environment

Linux

Steps to reproduce

No response

wallyqs commented 2 days ago

This will be a side effect from the way that cluster discovery works, the error would still show in the logs but would fade out after some time. The important part is to make sure that all the nodes have the same extra routes to avoid partitions, so if you change the configuration maps from both clusters to include the explicit routes and then issue a config reload you would have the mesh being formed.

swb2-izu-ssp commented 13 hours ago

Hello @wallyqs Many thanks for your answer.

The configuration map of the 3 parts looks like this

PART 1

 routes: 
  - tls://stretch-cluster-0.nats.nats-stretch-1.svc.cluster.local:6222
  - tls://stretch-cluster-1.nats.nats-stretch-1.svc.cluster.local:6222
  - tls://stretch-cluster-2.nats.nats-stretch-1.svc.cluster.local:6222
  - tls://nats-nats-stretch-2.mydomain.com:443
  - tls://nats-nats-stretch-3..mydomain.com:443

PART 2

 routes: 
  - tls://stretch-cluster-0.nats.nats-stretch-2.svc.cluster.local:6222
  - tls://stretch-cluster-1.nats.nats-stretch-2.svc.cluster.local:6222
  - tls://stretch-cluster-2.nats.nats-stretch-2.svc.cluster.local:6222
  - tls://nats-nats-stretch-1.mydomain.com:443
  - tls://nats-nats-stretch-3..mydomain.com:443

PART 3

 routes: 
  - tls://stretch-cluster-0.nats.nats-stretch-3.svc.cluster.local:6222
  - tls://stretch-cluster-1.nats.nats-stretch-3.svc.cluster.local:6222
  - tls://stretch-cluster-2.nats.nats-stretch-3.svc.cluster.local:6222
  - tls://nats-nats-stretch-1.mydomain.com:443
  - tls://nats-nats-stretch-2..mydomain.com:443

As a matter of fact, I made it works, by disabling the no_adveritise key. So, in a configuration, where I am deploying the 3 namespaces on same paas, this works.

But as soon, as I deployed the 3 namespaces on 3 different paases, I now have cluster un-stabilities. image

This is varying if I run several time the nats list server cmd. What I am missing?

Nicolas