nats-io / nack

NATS Controllers for Kubernetes (NACK)
Apache License 2.0
160 stars 61 forks source link

Connections to NATS server not closed #151

Closed AMarti96 closed 11 months ago

AMarti96 commented 11 months ago

Hello team! We have been using NACK for a while towards a NATS server located in Kubernetes, but today we started the migration towards Synadia Cloud as it will avoid us the maintenance of the NATS Cluster.

But, when trying to integrate our current NACK CRDs (creating tens of subjects and consumers for 10 different accounts), we started to receive errors from our instance. After a bit of debugging we realized the problem seems to be on how NACK is handling the connections towards the server.

Any suggestion or workaround other than killing the NACK instance each time to restart the connection count is appreciated!

What version were you using?

Using Synadia Cloud instance to allocate the NATS Server, where connections are limited to a certain amount per account.

What environment was the server running in?

Running NACK v0.13.0

Using the following image: natsio/jetstream-controller:0.13.0

Is this defect reproducible?

Yes, it is

Create a new account in Synadia Cloud (free tier is enough). Then, start up a NACK connected to that account and try to create one stream and one consumer.

For the NACK creation

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: am-nack

helmCharts:
- name: nack
  valuesInline:
    jetstream:
      enabled: true
      image:
        repository: natsio/jetstream-controller
        tag: 0.13.0
    namespaced: true
    namespaceOverride: am-nack
  releaseName: nack
  version: 0.24.0
  repo: https://nats-io.github.io/k8s/helm/charts/

Then create the following resources for NACK to process

apiVersion: jetstream.nats.io/v1beta2
kind: Account
metadata:
  name: poc-a
spec:
  name: poc-a
  servers:
  - tls://connect.ngs.global
  creds:
    secret:
      name: nats-poc-a-creds 
    file: poc_a.creds
---
apiVersion: jetstream.nats.io/v1beta2
kind: Stream
metadata:
  name: my-stream
spec:
  name: my-stream
  account: 'poc-a'
  subjects: 
    - "my-subject"
  retention: "limits"
  maxConsumers: -1
  maxMsgsPerSubject: -1
  maxMsgs: 0
  maxBytes: 512
  maxAge: "0"
  maxMsgSize: -1
  storage: file
  discard: old
  replicas: 1
  duplicateWindow: "120000000000ns"
  denyDelete: false
  allowRollup: false
  allowDirect: false
---
apiVersion: jetstream.nats.io/v1beta2
kind: Consumer
metadata:
  name: my-consumer
spec:
  streamName: my-stream
  account: 'poc-a'
  ackPolicy: explicit
  ackWait: "30000000000ns"
  deliverPolicy: all
  deliverSubject: my-subject
  deliverGroup: my-subject
  durableName: my-subject
  filterSubject: my-subject
  maxAckPending: 1000
  maxDeliver: -1
  replayPolicy: instant
  replicas: 0

Once applied, in the logs from NACK I can see them correctly created

I1109 14:41:19.863759       1 main.go:122] Starting /jetstream-controller v0.13.0...
I1109 14:41:42.450251       1 event.go:298] Event(v1.ObjectReference{Kind:"Stream", Namespace:"am-nack", Name:"my-stream", UID:"a7d72228-2804-4a72-9c6d-a727407f71a4", APIVersion:"jetstream.nats.io/v1beta2", ResourceVersion:"74174200", FieldPath:""}): type: 'Normal' reason: 'Connecting' Connecting to new nats-servers
I1109 14:41:42.522663       1 event.go:298] Event(v1.ObjectReference{Kind:"Consumer", Namespace:"am-nack", Name:"my-consumer", UID:"35d8458b-8d58-4c34-a865-d78ffe495cc2", APIVersion:"jetstream.nats.io/v1beta2", ResourceVersion:"74174197", FieldPath:""}): type: 'Normal' reason: 'Connecting' Connecting to new nats-servers
I1109 14:41:42.557387       1 event.go:298] Event(v1.ObjectReference{Kind:"Stream", Namespace:"am-nack", Name:"my-stream", UID:"a7d72228-2804-4a72-9c6d-a727407f71a4", APIVersion:"jetstream.nats.io/v1beta2", ResourceVersion:"74174200", FieldPath:""}): type: 'Normal' reason: 'Creating' Creating stream "my-stream"
I1109 14:41:42.632145       1 event.go:298] Event(v1.ObjectReference{Kind:"Consumer", Namespace:"am-nack", Name:"my-consumer", UID:"35d8458b-8d58-4c34-a865-d78ffe495cc2", APIVersion:"jetstream.nats.io/v1beta2", ResourceVersion:"74174197", FieldPath:""}): type: 'Normal' reason: 'Creating' Creating consumer "my-consumer" on stream "my-stream"
I1109 14:41:42.637505       1 event.go:298] Event(v1.ObjectReference{Kind:"Stream", Namespace:"am-nack", Name:"my-stream", UID:"a7d72228-2804-4a72-9c6d-a727407f71a4", APIVersion:"jetstream.nats.io/v1beta2", ResourceVersion:"74174200", FieldPath:""}): type: 'Normal' reason: 'Connecting' Connecting to new nats-servers
I1109 14:41:42.739325       1 event.go:298] Event(v1.ObjectReference{Kind:"Consumer", Namespace:"am-nack", Name:"my-consumer", UID:"35d8458b-8d58-4c34-a865-d78ffe495cc2", APIVersion:"jetstream.nats.io/v1beta2", ResourceVersion:"74174197", FieldPath:""}): type: 'Normal' reason: 'Connecting' Connecting to new nats-servers
I1109 14:41:42.835188       1 event.go:298] Event(v1.ObjectReference{Kind:"Stream", Namespace:"am-nack", Name:"my-stream", UID:"a7d72228-2804-4a72-9c6d-a727407f71a4", APIVersion:"jetstream.nats.io/v1beta2", ResourceVersion:"74174200", FieldPath:""}): type: 'Normal' reason: 'Created' Created stream "my-stream"
I1109 14:41:42.936729       1 event.go:298] Event(v1.ObjectReference{Kind:"Consumer", Namespace:"am-nack", Name:"my-consumer", UID:"35d8458b-8d58-4c34-a865-d78ffe495cc2", APIVersion:"jetstream.nats.io/v1beta2", ResourceVersion:"74174197", FieldPath:""}): type: 'Normal' reason: 'Created' Created consumer "my-consumer" on stream "my-stream"

The same goes for Synadia UI, I can see them.

But the Connections count is kept at 2 and never goes down (waited for more than 1h and nothing).

image

Similarly, if the Stream/Consumer has any kind of typo in the spec, NACK opens an infinite amount of connections during the retries trying to reconcile, which makes the Synadia Account stop processing the connections.

Given the capability you are leveraging, describe your expectation?

I would expect NACK to only use 1 connection to NATS given a set of resources all pointing to the same account, and not create a new connection for each time the reconcile loop is processed.

Given the expectation, what is the defect you are observing?

More connections than necessary are created in NACK and old connections are never closed.

AMarti96 commented 11 months ago

To provide more detail, I was able to extract the number of connections in our current NATS server(running inside a Kuberentes cluster and populating Streams/Consumers via NACK). Using nats-top I was able to get the following

image

as you can see, all of them (the 95 connections in this specific screenshot) are from jetstream-controller, which is NACK creating the Streams/Consumers and never disconnecting

AMarti96 commented 11 months ago

The problem seems to be only when the connection to NATS is defined in the account CRD. In the code, that means when crdConnect is set to true.

When setting one NATS connection in the overall server settings (crdConnect set to false) it doesn't matter how many objects I create or how many times the connection is retried, only 1 connection is reported:

image

With that in mind, I think the error may come in this part of the code:

https://github.com/nats-io/nack/blob/b6bb02b52a2ed736b243680037c6c5f10c2456d8/controllers/jetstream/stream.go#L183-L199

caleblloyd commented 11 months ago

Thanks for reporting! We should be able to put a connection pooler into NACK to prevent this. There is already an implementation in the nats-surveyor repo.

We'll port it over should be able to get that done next week

caleblloyd commented 11 months ago

Connection pooling reference from nats-surveyor: https://github.com/nats-io/nats-surveyor/blob/main/surveyor/conn_pool.go

caleblloyd commented 11 months ago

Connection pool added in v0.14.0