nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16.04k stars 1.41k forks source link

TLS handshake errors on nats container in K8 #5124

Open amit12cool opened 9 months ago

amit12cool commented 9 months ago

Observed behavior

I have a publisher publishing message to NATS server. The publisher is an Azure function publishing messages at regular intervals.

I see the errors on the publisher:-

2024-02-23T18:10:33.016Z] nats: encountered error
Traceback (most recent call last):
  File "C:\Users\abcd\Code\Veli-IoT-System\message-ingestor\.venv\lib\site-packages\nats\aio\client.py", line 714, in _close
    await self._transport.wait_closed()
  File "C:\Users\abcd\Code\Veli-IoT-System\message-ingestor\.venv\lib\site-packages\nats\aio\transport.py", line 180, in wait_closed
    return await self._io_writer.wait_closed()
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\asyncio\streams.py", line 359, in wait_closed
    await self._protocol._get_close_waiter(self)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\asyncio\sslproto.py", line 534, in data_received
    ssldata, appdata = self._sslpipe.feed_ssldata(data)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\asyncio\sslproto.py", line 206, in feed_ssldata
    self._sslobj.unwrap()
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 949, in unwrap
    return self._sslobj.shutdown()
ssl.SSLError: [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2756)

I see errors in nats(server) containers as below

[7] 2024/02/23 18:14:27.845444 [ERR] 10.240.0.10:29783 - cid:206 - TLS handshake error: read tcp 10.244.6.44:4222->10.240.0.10:29783: i/o timeout
2/23/2024, 11:44:28 PM
nats-2
d13670d74711ba3ebf50e939852eb2efdc7dbe6010bb37abd81dcf8f7f2e8542
[7] 2024/02/23 18:14:28.144152 [ERR] 10.240.0.10:29783 - cid:206 - TLS handshake timeout
[7] 2024/02/23 18:14:28.806477 [ERR] 10.240.0.12:39668 - cid:207 - TLS handshake error: read tcp 10.244.6.44:4222->10.240.0.12:39668: i/o timeout
2/23/2024, 11:44:31 PM
nats-2
d13670d74711ba3ebf50e939852eb2efdc7dbe6010bb37abd81dcf8f7f2e8542
[7] 2024/02/23 18:14:31.045227 [ERR] 10.240.0.14:17010 - cid:208 - TLS handshake error: read tcp 10.244.6.44:4222->10.240.0.14:17010: i/o timeout

Expected behavior

No TLS errors should be there.

Server and client version

NATS version - 2.10.4-alpine Client -> nats-py==2.6.0

Host environment

K8's

Steps to reproduce

TLS auth

wallyqs commented 9 months ago

I think you may have to bump the default tls { timeout = X } and authorization { timeout = X } for your deployment, have you tried tuning those?

amit12cool commented 8 months ago

I think you may have to bump the default tls { timeout = X } and authorization { timeout = X } for your deployment, have you tried tuning those?

@wallyqs where can I bump this in the https://github.com/nats-io/k8s/blob/main/helm/charts/nats/values.yaml helm chart.

Also I didn't find a timeout option here https://docs.nats.io/running-a-nats-service/configuration/securing_nats/authorization

amit12cool commented 8 months ago

@wallyqs with timeout = 50 in tls the timeout error was removed. But I see tls handhsake errors like below

[7] 2024/03/13 19:50:46.143685 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-1.nats-headless": lookup nats-1.nats-headless on 10.0.0.10:53: no such host
[7] 2024/03/13 19:50:46.143905 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats-headless": lookup nats-0.nats-headless on 10.0.0.10:53: no such host
[7] 2024/03/13 19:50:46.172442 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats-headless": lookup nats-0.nats-headless on 10.0.0.10:53: no such host
[7] 2024/03/13 19:50:46.173120 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-1.nats-headless": lookup nats-1.nats-headless on 10.0.0.10:53: no such host
[7] 2024/03/12 18:49:37.957961 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats-headless": lookup nats-0.nats-headless on 10.0.0.10:53: no such host

[7] 2024/03/12 17:40:23.567594 [ERR] 10.240.0.4:59217 - cid:44 - TLS handshake error: read tcp 10.244.2.14:4222->10.240.0.4:59217: read: connection reset by peer

Points to note:-

Name: nats-1.nats-headless.nats-playground-3.svc.cluster.local Address: 10.244.3.198


So above is resolved to the address `10.244.3.198` from outside pod but not inside from pod.

- I have no client connected to the nats server i.e. any of the nats POD in k8
-  `10.244.2.14:4222` I know this is my nats pod server ip. But I don't know whose Ip is this `0.240.0.4:59217`. any idea, with whom the tcp connection is being made from bats server?
- 
I have below nats.conf

{ "cluster": { "name": "nats", "no_advertise": true, "port": 6222, "routes": [ "nats://nats-0.nats-headless:6222", "nats://nats-1.nats-headless:6222", "nats://nats-2.nats-headless:6222" ] }, "http_port": 8222, "jetstream": { "max_file_store": 10Gi, "max_memory_store": 0, "store_dir": "/data" }, "lame_duck_duration": "30s", "lame_duck_grace_period": "10s", "pid_file": "/var/run/nats/nats.pid", "port": 4222, "server_name": $SERVER_NAME, "tls": { "ca_file": "/mnt/nats-certificate/rootCA-playground.crt", "cert_file": "/mnt/nats-certificate/nats-playground-server.crt", "key_file": "/mnt/nats-certificate/nats-playground-server.key", "timeout": 50, "verify": true } }


Here is my helm chart

config: cluster: enabled: true replicas: 3 port: 6222 jetstream: enabled: true fileStore: pvc: size: 10Gi nats: tls: enabled: true merge: { verify: true, cert_file: '/mnt/nats-certificate/nats-playground-server.crt', key_file: '/mnt/nats-certificate/nats-playground-server.key', ca_file: '/mnt/nats-certificate/rootCA-playground.crt', timeout: 50 }

podTemplate: topologySpreadConstraints: kubernetes.io/hostname: maxSkew: 1 whenUnsatisfiable: DoNotSchedule patch:

service: merge: spec: type: LoadBalancer

promExporter: enabled: true podMonitor: enabled: true

container: image: repository: nats image: 2.10.11-alpine patch:

reloader: patch:

amit12cool commented 8 months ago

@neilalexander @wallyqs please check above. This is critical for me.