nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16.01k stars 1.41k forks source link

2.10.20+ Loop detected for leafnode account #6037

Open kukumber opened 4 weeks ago

kukumber commented 4 weeks ago

Observed behavior

Nats 2.10.20+ Leafnode Error 'Loop detected for leafnode account="$G". Delaying attempt to reconnect for 30s'

Expected behavior

Everything works as it did with version 2.10.7, without "Loop detected for leafnode account" errors.

Server and client version

2.10.7-alpine works correctly Issue reproduces with 2.10.20-alpine, 2.10.22-alpine

Host environment

No response

Steps to reproduce

Three-node jetstream cluster, each with the following configuration (with server_name and advertise being different for each node). Each NATS container runs on the different virtual host.

listen: 0.0.0.0:4222
server_name: "server1"

jetstream {
  store_dir: "/data"
  domain: main_test
}

cluster {
  listen: 0.0.0.0:6222
  advertise: server1:6222
  name: test
  routes: [
    nats://server1:6222
    nats://server2:6222
    nats://server3:6222
  ]
  tls: {
    cert_file: "/etc/pki/host_cert.pem"
    key_file: "/etc/pki/host_key.pem"
    ca_file: "/etc/pki/ca.crt"
  }
}

leafnodes {
  remotes: [
    {
      tls: {
        cert_file: "/etc/pki/host_cert.pem"
        key_file: "/etc/pki/host_key.pem"
        ca_file: "/etc/pki/ca.crt"
      }
      urls: [
        "nats-leaf://fe1-t:7444"
      ]
    },
    {
      tls: {
        cert_file: "/etc/pki/host_cert.pem"
        key_file: "/etc/pki/host_key.pem"
        ca_file: "/etc/pki/ca.crt"
      }
      urls: [
        "nats-leaf://fe2-t:7444"
      ]
    }
  ]
}

Two remote servers with the following configuration (with server_name, domain, and advertise being different for each):

listen: 0.0.0.0:4222
server_name: "fe1-t"

jetstream {
  store_dir: "/data"
  domain: fe1-t
}

leafnodes {
  port: 7444
  advertise: fe1-t:7444
  tls: {
    cert_file: "/etc/pki/host_cert.pem"
    key_file: "/etc/pki/host_key.pem"
    verify: true
    ca_file: "/etc/pki/ca.crt"
  }
}

Nodes fe1-t and fe2-t are not aware of each other.

With NATS version 2.10.7, I set up stream mirrors, and my clients can connect to the fe1-t and fe2-t hosts without issues. However, when I try to upgrade NATS on all the nodes to version 2.10.20+, I start receiving "Loop detected" errors on both the fe*-t and cluster nodes.

fe*-t errors:

[1] 2024/10/24 10:41:03.592127 [INF] 192.168.1.6:35636 - lid:170 - JetStream using domains: local "fe2-t", remote "main_test"
[1] 2024/10/24 10:41:03.620127 [INF] 192.168.1.5:58126 - lid:171 - Leafnode connection created
[1] 2024/10/24 10:41:03.642069 [INF] 192.168.1.7:33452 - lid:172 - Leafnode connection created
[1] 2024/10/24 10:41:03.658231 [INF] 192.168.1.5:58126 - lid:171 - JetStream using domains: local "fe2-t", remote "main_test"
[1] 2024/10/24 10:41:03.666950 [ERR] 192.168.1.6:35636 - lid:170 - Loop detected for leafnode account="$G". Delaying attempt to reconnect for 30s
[1] 2024/10/24 10:41:03.667095 [INF] 192.168.1.6:35636 - lid:170 - Leafnode connection closed: Protocol Violation - Account: $G
[1] 2024/10/24 10:41:03.676195 [INF] 192.168.1.7:33452 - lid:172 - JetStream using domains: local "fe2-t", remote "main_test"
[1] 2024/10/24 10:41:03.685937 [ERR] 192.168.1.5:58126 - lid:171 - Loop detected for leafnode account="$G". Delaying attempt to reconnect for 30s
[1] 2024/10/24 10:41:03.685962 [INF] 192.168.1.5:58126 - lid:171 - Leafnode connection closed: Protocol Violation - Account: $G
[1] 2024/10/24 10:41:34.678225 [INF] 192.168.1.6:56298 - lid:174 - Leafnode connection created
[1] 2024/10/24 10:41:34.688166 [INF] 192.168.1.6:56298 - lid:174 - JetStream using domains: local "fe2-t", remote "main_test"
[1] 2024/10/24 10:41:34.692347 [INF] 192.168.1.5:40116 - lid:175 - Leafnode connection created
[1] 2024/10/24 10:41:34.703173 [INF] 192.168.1.5:40116 - lid:175 - JetStream using domains: local "fe2-t", remote "main_test"
[1] 2024/10/24 10:41:34.709202 [ERR] 192.168.1.7:33452 - lid:172 - Loop detected for leafnode account="$G". Delaying attempt to reconnect for 30s
[1] 2024/10/24 10:41:34.709224 [INF] 192.168.1.7:33452 - lid:172 - Leafnode connection closed: Protocol Violation - Account: $G

Cluster errors:

[1] 2024/10/24 10:41:34.678934 [INF] 192.168.2.6:7444 - lid:549 - Leafnode connection created for account: $G 
[1] 2024/10/24 10:41:34.690719 [INF] 192.168.2.6:7444 - lid:549 - JetStream using domains: local "main_test", remote "fe2-t"
[1] 2024/10/24 10:42:01.362957 [INF] 192.168.2.6:7444 - lid:549 - Leafnode connection closed: Client Closed - Account: $G
[1] 2024/10/24 10:42:03.367913 [ERR] Error trying to connect as leafnode to remote server "fe2-t:7444" (attempt 1): dial tcp 192.168.2.6:7444: i/o timeout
[1] 2024/10/24 10:42:04.681663 [INF] 192.168.2.5:7444 - lid:587 - Leafnode connection created for account: $G 
[1] 2024/10/24 10:42:04.704360 [INF] 192.168.2.5:7444 - lid:587 - JetStream using domains: local "main_test", remote "fe1-t"
[1] 2024/10/24 10:42:04.792195 [INF] 192.168.2.6:7444 - lid:588 - Leafnode connection created for account: $G 
[1] 2024/10/24 10:42:04.932000 [INF] 192.168.2.6:7444 - lid:588 - JetStream using domains: local "main_test", remote "fe2-t"
[1] 2024/10/24 10:42:04.947568 [ERR] 192.168.2.6:7444 - lid:588 - Leafnode Error 'Loop detected for leafnode account="$G". Delaying attempt to reconnect for 30s'
[1] 2024/10/24 10:42:04.947601 [ERR] 192.168.2.6:7444 - lid:588 - Loop detected for leafnode account="$G". Delaying attempt to reconnect for 30s
[1] 2024/10/24 10:42:04.947610 [INF] 192.168.2.6:7444 - lid:588 - Leafnode connection closed: Protocol Violation - Account: $G
neilalexander commented 4 weeks ago

Is this still happening if you set no_advertise on fe*-t leafnode configuration?

kukumber commented 4 weeks ago

@neilalexander unfortunately no_advertise didn't help

kukumber commented 4 weeks ago

The issue does not reproduce when all the containers run on the same host