nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.68k stars 1.39k forks source link

Improve TLS handshake error message #1203

Open JensRantil opened 4 years ago

JensRantil commented 4 years ago

Defects

I'm running a NATS service 2.1.2 I am using TLS between my NATS brokers (port 6222) as well as TLS from clients to brokers (port 4222). Additionally I have added a http: localhost:8222 config stanza to get metrics.

My NATS servers are outputting

...
[1] 2019/11/26 14:52:29.877977 [ERR] X.X.X.X:48306 - cid:48260 - TLS handshake error: EOF
[1] 2019/11/26 14:52:30.171968 [ERR] X.X.X.X:39272 - cid:48261 - TLS handshake error: EOF
[1] 2019/11/26 14:52:30.284848 [ERR] X.X.X.X:19473 - cid:48262 - TLS handshake error: EOF
[1] 2019/11/26 14:52:30.443337 [ERR] X.X.X.X:30436 - cid:48263 - TLS handshake error: EOF
[1] 2019/11/26 14:52:30.716426 [ERR] X.X.X.X:6808 - cid:48264 - TLS handshake error: EOF
[1] 2019/11/26 14:52:30.872010 [ERR] X.X.X.X:34951 - cid:48265 - TLS handshake error: EOF
[1] 2019/11/26 14:52:31.293905 [ERR] X.X.X.X:14828 - cid:48266 - TLS handshake error: EOF
[1] 2019/11/26 14:52:31.576071 [ERR] X.X.X.X:48940 - cid:48267 - TLS handshake error: EOF
...

I'm pretty sure I have identified the culprit as being a load balancer pinging my NATS instances (on port 4222). However, it would be very helpful if the error logs could say which port the TLS handshake fails on to be able to debug errors like this a lot easier. See this as a feature request.

JensRantil commented 4 years ago

Also, should a TCP ping (from a load balancer) really be an ERR? In my case, I'd love to suppress the message as it really doesn't require any action from me.

derekcollison commented 4 years ago

The load balancer is most likely a layer 7 router with health checks. These are protocol aware and will not work with the NATS protocol.

If you run the server temporarily with the -DV flag you will see more information.

JnMik commented 4 years ago

Oh thanks for pointing out it could be the load balancer, that's exactly my issue as well. I removed the target groups and no more issues. Not sure how to fix this, or if there is an alternative to expose it since Nats is running as containers in my private kubernetes cluster.

derekcollison commented 4 years ago

If you can configure the load balancer for its health checks, run the nats-server with monitoring turned on and have the health check hit that http endpoint instead.

https://docs.nats.io/nats-server/configuration/monitoring#enabling-monitoring-from-the-command-line

These can be plain HTTP or TLS, up to you.

derekcollison commented 4 years ago

But again, in general you want to use host port and allow direct access to the NATS servers and avoid ingress proxies etc.

JnMik commented 4 years ago

Thanks for the quick reply.

So you would suggest running a load balancer layer 7 (aws alb) instead of layer 4 (aws nlb) ? Because I read somewhere else that Nats had issues with Layer 7 and it would be preferable to run with a layer 4 (https://github.com/nats-io/nats-server/issues/291). But from what I know, layer 4 Healthchecks cannot hit on specific route (what you suggest).

JnMik commented 4 years ago

I feel like Nats has been developped with the idea to deploy on auto-scaling-group in mind, and not really targeting private Kubernetes cluster hosting.

I still have time to switch back to that route and discover bosh cli, if you think it would be less burdensome. But if plenty of people have success running it as container in private cluster, I might as well just continue that way.

Just wondering what's your thought on this.

(Edit: Sorry if that's a bit far from the original thread subject here, I think somehow that it's kinda related anyway hehe )

derekcollison commented 4 years ago

I would not put any LB in between NATS clients and servers. Would just use DNS with multiple A records or a list of servers in the client. NATS handles all that stuff for you and better then the LBs do.

NATS protocol was designed ~10yrs ago, way before k8s was on the scene ;)

derekcollison commented 4 years ago

@wallyqs may have some helpful hints too.

JnMik commented 4 years ago

So deploying Nats on dedicated VMs would be the way to go, as I don't plan to expose my Kubernetes nodes publicly. Just installed Bosh CLI, we'll see how it goes :)

derekcollison commented 4 years ago

We install NATS servers in k8s all the time, we just allow direct access via host port config and avoid clients going through the ingress controller or any other proxy/LB.

@wallyqs can show you some more details.

wallyqs commented 4 years ago

@JnMik currently the most reliable way to deploy NATS on K8S right now would be to use host ports and then expose the public ip from the Kubelets for external access. Then the external-dns component can be helpful to manage registering dynamically the public ips to the DNS records based on the headless service that represents the NATS Server nodes.

JnMik commented 4 years ago

Hello @wallyqs

Ok I'll lookup the external-dns and the kubeletes public ip thing see if I can figure it out. Thanks !

JBHarvey commented 4 years ago

Hi @wallyqs

When you use the external-dns in this scenario, do you hook it directly to the nats server, or do you need some kind of nginx-ingress-controller between the nats service and the external-dns?

(i.e. this examples: https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/public-private-route53.md , with then the nginx-ingress pointing to NATS ?)

JnMik commented 4 years ago

Considering a Kubernetes cluster where all nodes have private ip, is there really a way to expose a service with a public ip without using type=LoadBalancer ? I tried some stuff and it feels like no.

The ingress-controller seems like an alternative to the AWS LB Layer 4, so maybe we will be able to specify a different healthcheck in the ingress controller that won't pollute the logs. But the ingress controller will be exposed via a LoadBalancer anyway.

External-DNS could then update the DNS record using the ingress controller load balancer IP.

That's the only way I can think.

JnMik commented 4 years ago

I tried exposing the nats monitor port like this (an attempt to have a public ip directly on a service, in a private kubernetes cluster), but it's just unreachable.

resource "kubernetes_service" "nats-expose-monitor-public" {
  metadata {
    name = "nats-expose-monitor-public"
    namespace = "default"
    labels = {
      app = "nats"
    }
  }

  spec {

    selector = {
      env = var.env
      app = "nats"
    }

    external_ips = [
      aws_eip.nats-0-ip.public_ip
    ]

    port {
        protocol = "TCP"
        port = <some-port>
        target_port = 8222
    }

  }
}

resource "aws_eip" "nats-0-ip" {
  vpc = true
  tags = {Name = "bla bla bla"}
}

kubectl get service nats-expose-monitor-public ClusterIP <private-ip> <public-ip> <some-port>/TCP

JnMik commented 4 years ago

If I release some public Kubernetes nodes (public subnets) inside my cluster, and assign them ips, my cluster gets hybrid private/public and I can reach the "". Not sure it's really good security wise though

JnMik commented 4 years ago

hey @wallyqs, now that I think of it, when using Kubernetes Service Node Port, it exposes the port on all nodes and the request to nats is already randomly balanced in the pods of the statefulset. So having a load balancer in front of it doesn't really change a thing right ?

Edit: I refer to this image only to proove my point that the node port is actually balanced between nodes, but VMs should be inside the kubernetes, the drawing is weirdly done)

wallyqs commented 4 years ago

@JBHarvey we need docs documenting external-dns and this approach (working on that...) but basically when you create a cluster in AWS with eksctl for example, it does create nodes that have an available public ip by default for example:

# Create 3 nodes Kubernetes cluster
eksctl create cluster --name nats-k8s-cluster \
  --nodes 3 \
  --node-type=t3.large \
  --region=eu-west-1

# Get the credentials for your cluster
eksctl utils write-kubeconfig --name $YOUR_EKS_NAME --region eu-west-1

After that is done you get a set of 3 nodes with the example above:

 kubectl get nodes -o wide
NAME                                           STATUS   ROLES    AGE    VERSION   INTERNAL-IP      EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-10-213.us-east-2.compute.internal   Ready    <none>   124d   v1.12.7   192.168.10.213   3.17.184.16     Amazon Linux 2   4.14.123-111.109.amzn2.x86_64   docker://18.6.1
ip-192-168-45-209.us-east-2.compute.internal   Ready    <none>   124d   v1.12.7   192.168.45.209   18.218.52.122   Amazon Linux 2   4.14.123-111.109.amzn2.x86_64   docker://18.6.1
ip-192-168-65-15.us-east-2.compute.internal    Ready    <none>   124d   v1.12.7   192.168.65.15    3.15.38.138     Amazon Linux 2   4.14.123-111.109.amzn2.x86_64   docker://18.6.1

Then you can deploy NATS and create a headless service named nats which will represent the NATS Server nodes:

kubectl get svc nats -o wide
NAME   TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                                                 AGE   SELECTOR
nats   ClusterIP   None         <none>        4222/TCP,6222/TCP,8222/TCP,7777/TCP,7422/TCP,7522/TCP   36d   app=nats

Once deploying external-dns, you do currently have to use a NodePort with something as follows to keep the nodes mapped by the external dns with the ones from the headless service:

apiVersion: v1
kind: Service
metadata:
  name: nats-nodeport
  labels:
    app: nats
  annotations:
    external-dns.alpha.kubernetes.io/hostname: nats.example.com
spec:
  type: NodePort
  selector:
    app: nats
  externalTrafficPolicy: Local
  ports:
  - name: client
    port: 4222
    nodePort: 30222 #  Arbitrary port to represent the external dns service, external-dns issue...
    targetPort: 4222  # NOTE: the NATS pods also use host ports

The external-dns process would be responsible of registering the public ips from the nodes to be serviced at nats.example.com.

wallyqs commented 4 years ago

@JnMik that's right another way to go would be to use a NodePort and have K8S do the load balancing, I think this goes through iptables rules from K8S so just have to keep that in mind, but this would workaround the limitations of using a load balancer ingress for NATS which basically prevents being able to use TLS connections and affects performance as well.

Besides having to use a high port from the nodeport range, one other inconvenience of using a NodePort is that the client advertisements from NATS would be that of the internal ip addresses from the K8S network, so if a client connects externally from the nodeport when trying to reconnect they would try to reconnect to a private ip. To workaround that issue you could either disable advertisements (--no-advertise flag), and then let the clients reconnect to the public ip from the nodeport using the high port from the nodeport range.

JnMik commented 4 years ago

So we tried external-dns this afternoon. I gotta say first, it was on a hybrid eks cluster (some nodes in private subnet, some nodes in public subnet to host nats containers).

When using a headless service, external-dns was creating 3 records in our route53, nats-1.xxx.com nats-2.xxx.com nats-3.xxx.com However the A record was pointing to the nodes private IPs! So it wasn't working.

If I turn the service into a NodePort Service, External-dns creates 1 record only, nats.xxx.com and the 3 publics IPs are in the A record.

Do you think nats will have any issues if we use 1 record with 3 IPs in it balanced ?

thanks @wallyqs for your time

JnMik commented 4 years ago

I'm thinking moving the nats container to a separate cluster with only node with private IPs, instead of having a hybrid one.

I suppose the headless service would then generate 3 records pointing to public IPs (hopefully?)

Could be an idea.

wallyqs commented 4 years ago

Do you think nats will have any issues if we use 1 record with 3 IPs in it balanced ?

Thanks for sharing @JnMik, I don't see an issue with this setup as long as the ExternalIP metadata is present in the Kubernetes cluster (that is if both INTERNAL IP and EXTERNAL IP are displayed when executing kubectl get nodes -o wide). If the external ip metadata is in the node, then the servers will be able to advertiser the other alive public ips that are part of the cluster and use that for reconnecting and failover right away and avoid the extra DNS lookup. The NATS clients also get a list of the ips when connecting and pick one randomly so clients should be distributed evenly as well.

We do something similar with the connect.ngs.global service that Synadia offers, for example the nodes available in the hostname uswest2.aws.ngs.global are right now for me:

dig uswest2.aws.ngs.global
...
;; ANSWER SECTION:
uswest2.aws.ngs.global. 60  IN  A   54.202.186.240
uswest2.aws.ngs.global. 60  IN  A   35.166.100.73
uswest2.aws.ngs.global. 60  IN  A   44.228.141.181

And if I nc or telnet against the client port I get the rest of the cluster members:

telnet uswest2.aws.ngs.global 4222
INFO {...,"cluster":"aws-uswest2","connect_urls":["35.166.100.73:4222","44.228.141.181:4222","54.202.186.240:4222"]} 

In order to enable this advertisements, we use the following initializer container that has some extra Kubernetes policy to be able to lookup what is the public ip of the Kubelet where it is running: https://github.com/nats-io/k8s/blob/master/nats-server/nats-server-with-auth-and-tls.yml#L132-L153 And have the server load that file from the config via an empty directory volume: https://github.com/nats-io/k8s/blob/master/nats-server/nats-server-with-auth-and-tls.yml#L54

JnMik commented 4 years ago

Hey @wallyqs

I managed to have the "advertise connect_urls" working, with the initContainer and all the stuff regarding advertise/advertiseconf. Great help ! Have a nice day.

atmosx commented 1 year ago

We built a tool called casper-3 at Gather Inc. to handle DNS registration for applications running in hostMode on specific node pools. Supports registration for pods, mostly used with sts, and nodes, mostly used with deployments. The tool is tailored around our needs - we're a rather small team - but it is open source and fairly straight forward if you're familiar with golang. Supports CloudFlare and DigitalOcean as DNS providers, but adding Route53 (or what you have) shouldn't be too complicated.

dan-connektica commented 1 month ago

I find it quite strange that the recommended way to run NATS in kubernetes requires public kubernetes nodes, which goes against the recommendation of cloud providers due to the security implications of having public instances. We are using a layer 4 NLB with AWS EKS to expose our NATS cluster, and I haven't yet found a good way to prevent the TLS handshake errors from the tcp health checks.

derekcollison commented 1 month ago

@dan-connektica NATS does not require security perimeter models, or load balancers to work properly and securely. Remember NATS can run anywhere, not just in a cloud provider, specifically out at the edge.

That being said, setting up health checks should be fairly straightforward and not really specific to a NATS system. The health check needs to be TLS aware, and if you are forcing client side certs would need those as well.

wallyqs commented 1 month ago

@dan-connektica you should be able to customize the probe follows to avoid those errors, for example:

spec:
      type: LoadBalancer
    metadata:
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-name: nats-nlb
        service.beta.kubernetes.io/aws-load-balancer-type: external
        service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: http
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "8222"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/"
dan-connektica commented 1 month ago

@wallyqs Thanks, adding the monitoring port as the healthcheck port did the trick!