nats-io / nats.node

Node.js client for NATS, the cloud native messaging system.
https://nats.io
Apache License 2.0
1.55k stars 163 forks source link

Statuses are not updated #609

Closed ArmorDarks closed 11 months ago

ArmorDarks commented 11 months ago

Observed behavior

Statuses iterator is not updating even if the server is going to LDM or just being shut down.

Expected behavior

The connection status should change

Server and client version

Tested on servers v2.8.3, v2.8.4, v2.9.6 Tested on client v2.4.0, v2.18.0

Host environment

No response

Steps to reproduce

So far, I haven't figured out a reliable reproduction.

What I noticed is that the issue is not happening if the client was started recently.

For instance, if a service client was just started and I kill the NATS server locally, it does trigger status updates, but if the service with a client were up for a few days, it would stop logging any status updates. Killing the NATS server will cause zero logs in that case. I'm not sure, though, if it's indeed related to the NATS client itself or maybe it's something due to a server that has been up locally in my case for days.

That issue happened for me locally both on server v2.8.3 and latest v2.9.6 (alpine image).

It's a similar situation in our production. If the service is running for quite some time, it will stop logging any status updates even if NATS server pads were restarted. We can observe that Rust-based services are properly detecting connection status changes, but Node-based services do not react in any way.

Example of NATS client initialization:


export class NatsClient {
  private readonly log = new Log('NatsClient')

  public init = async (): Promise<void> => {
    this.log.info('Initializing NATS connection')

    try {
      const connection = await connect({
        maxReconnectAttempts: -1,
        waitOnFirstConnect: true,
        reconnectJitter: 10_000,
        reconnectJitterTLS: 10_000,
        ignoreAuthErrorAbort: true,
        name: 'natsClientName',
        pass: 'natsPassword',
        servers: 'natsServer',
        user: 'natsUser',
        inboxPrefix: 'service.inbox',
      })

      connection.closed().then((error) => {
        if (error) {
          this.log.error(`NATS connection closed with error`, { error })

          return
        }

        this.log.info('NATS connection closed')
      })

      this.log.info('NATS connection initialized')

      this.listenToConnectionStatus(connection).catch((error) => {
        this.log.error(`Failed listening to connection status`, { error })
      })
    } catch (error) {
      this.log.error(`Failed to connect to NATS`, { error })

      throw error
    }
  }

  private listenToConnectionStatus = async (
    connection: NatsConnection,
  ): Promise<void> => {
    for await (const status of connection.status()) {
      // Ignore pings
      if (status.type === DebugEvents.PingTimer) return

      this.log.info(`NATS connection status changed to ${status.type}`)
    }
  }
}

There are multiple logs here. None of them were logged or added to metrics between the day it was deployed on prod and until now, even when NATS servers were restarted.

We know, though, that during service updates, NATS connection closed was logged, so that means connection.closed() is still working.

The issue started to happen somewhere from the moment we upgraded the NATS client to v2.4.0 many months ago, but thi also co-incedences with an update where we introduced settings like waitOnFirstConnect: true and ignoreAuthErrorAbort: true, and increased reconnection jitter.

Could it be that something is locked in the NATS client queue that handles statuses iterator?

aricart commented 11 months ago

If your code is exactly as you have it there, then I know why it is happening - you are returning from the status iterator, which means that it won't report anything else - simply continue the loop - you cannot return or break from it. Please feel free to reopen if that is not the case.

ArmorDarks commented 11 months ago

Nice catch, Alberto! That also explains why it stopped working after a while - the ping was sent after some time and that ended iterator. Thank you!

aricart commented 11 months ago

you were haunted by a callback model :)