rajch / weave

Simple, resilient multi-host containers networking and more.
https://rajch.github.io/weave/
Apache License 2.0
42 stars 8 forks source link

How can we reduce the Single Point of Failure in a weave cluster ? #9

Closed megarajan closed 3 months ago

megarajan commented 3 months ago

In our weave deployments, we see that when we bring down one of the weave nodes, the rest of the nodes go into re-discovery mode and were going to sleeve mode ( and never going back to fastdp mode) and high cpu in the older weave version.

But in the latest 2.8.1 version "falling to sleeve and staying in sleeve" is reduced very much and weave goes back to fastdp mode quickly.

But still we see rediscovery messages in rest of the weave nodes.

How can we reduce the Single Point of Failure in a weave cluster ?

Currently we are not explicitly specifying the connlimit and hence I believe it will be set to 100 by default.

Assuming we have 25 weave nodes in our current setup , Will increasing the connlimit to more than 100 will help ?

rajch commented 3 months ago

Changes in topology (like a node coming down) have to be communicated to all peers in the cluster. This can take some time. The process is described here.

As that article says here, a fully connected mesh of N peers should take N^2 connections. For 25 nodes, increasing the connection limit should help.

Could you post some logs, to help understand the problem better?

megarajan commented 3 months ago

hi attaching logs

Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.380666 ->[172.26.240.14:54183|c6:2f:d0:e1:0e:1c(drc02)]: connection shutting down due to error: read tcp 172.26.240.21:6783->172.26.240.14:54183: read: connection timed out Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.380763 ->[172.26.240.14:54183|c6:2f:d0:e1:0e:1c(drc02)]: connection deleted Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.380996 ->[172.26.240.14:6783] attempting connection Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.388082 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.392320 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.394436 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.402847 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.406750 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.412904 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.413114 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.413584 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.413644 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.413737 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.418345 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.418465 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.419718 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.419781 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.420632 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.421645 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.422677 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.425769 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:51.425938 Unable to find connection to relay peer c6:2f:d0:e1:0e:1c Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:51.925177 Vetoed installation of hairpin flow FlowSpec{keys: [InPortFlowKey{vport: 2} TunnelFlowKey{id: 0000000000fbe292, ipv4src: 172.26.240.33, ipv4dst: 172.26.240.21}], actions: [SetTunnelAction{id: 0000000000fbe292, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.33, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:51 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:51.926238 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbe292, ipv4src: 172.26.240.33, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbe292, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.33, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:52 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:52.001231 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbe292, ipv4src: 172.26.240.33, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbe292, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.33, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:52 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:52.052348 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbe292, ipv4src: 172.26.240.33, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbe292, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.33, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:52 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:52.265163 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbe173, ipv4src: 172.26.240.33, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbe173, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.33, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:52 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:52.304670 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbe1dd, ipv4src: 172.26.240.33, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbe1dd, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.33, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:53 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:53.392106 Received own packet to peer c6:2f:d0:e1:0e:1c(drc02) from MAC (52:da:a3:d3:80:ea) to (26:69:0a:04:51:1e) Jul 12 05:51:53 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:53.397312 Received own packet to peer c6:2f:d0:e1:0e:1c(drc02) from MAC (ae:11:e2:68:07:3c) to (e2:e3:90:4f:e2:0c) Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.728659 ->[172.26.240.24:6783|4a:4d:82:27:1a:19(drd06)]: connection shutting down due to error: read tcp 172.26.240.21:46397->172.26.240.24:6783: read: connection timed out Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.728818 ->[172.26.240.24:6783|4a:4d:82:27:1a:19(drd06)]: connection deleted Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.729404 ->[172.26.240.24:6783] attempting connection Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.730923 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.743521 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.744391 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.748694 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.762459 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.762626 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.762691 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.770928 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.771438 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.771888 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.772303 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.772569 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.772780 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.773136 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.789760 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.801208 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.801413 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.802917 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.815967 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.820119 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.823763 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:55 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:55.824092 Unable to find connection to relay peer 4a:4d:82:27:1a:19 Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.577767 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbe47c, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbe47c, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.597436 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 00000000004f447c, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 00000000004f447c, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.765662 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbe47c, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbe47c, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.859984 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 00000000004f439d, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 00000000004f439d, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.920665 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbebb1, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbebb1, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.921096 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 00000000004f4173, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 00000000004f4173, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.921420 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbe229, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbe229, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.921628 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 00000000004f439d, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 00000000004f439d, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.939962 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 00000000004f4bd1, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 00000000004f4bd1, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.940048 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 0000000000fbebb1, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 0000000000fbebb1, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.989095 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 00000000004f439d, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 00000000004f439d, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:56 drd03 98f12d3f9bd9[21292]: WARN: 2024/07/12 05:51:56.989492 Vetoed installation of hairpin flow FlowSpec{keys: [TunnelFlowKey{id: 00000000004f4bb1, ipv4src: 172.26.240.16, ipv4dst: 172.26.240.21} InPortFlowKey{vport: 2}], actions: [SetTunnelAction{id: 00000000004f4bb1, ipv4src: 172.26.240.21, ipv4dst: 172.26.240.16, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]} Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.050112 Discovered remote MAC 8a:1c:77:29:be:42 at 0a:7a:a8:17:ac:4c(artlva1drm01v) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.613057 Received packet for unknown destination: c6:2f:d0:e1:0e:1c(drc02) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.614050 Received packet for unknown destination: c6:2f:d0:e1:0e:1c(drc02) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.614324 Received packet for unknown destination: 4a:4d:82:27:1a:19(drd06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.619962 Received packet for unknown destination: c6:2f:d0:e1:0e:1c(drc02) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.641702 Received packet for unknown destination: c6:2f:d0:e1:0e:1c(drc02) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.659426 Received packet for unknown destination: 4a:4d:82:27:1a:19(drd06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.667120 Captured frame from MAC (0a:00:97:c4:5a:99) to (46:4f:8f:00:32:c9) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.696427 Captured frame from MAC (12:cc:22:5e:74:56) to (72:6e:1d:43:85:d7) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.696525 Captured frame from MAC (8e:64:30:b3:b5:da) to (1a:9a:c7:05:3f:73) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.696580 Captured frame from MAC (2e:f1:ba:01:0b:0c) to (46:4f:8f:00:32:c9) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.704779 Received packet for unknown destination: c6:2f:d0:e1:0e:1c(drc02) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.729013 Captured frame from MAC (4a:7a:bb:95:96:31) to (ae:42:3d:6c:eb:6b) associated with another peer 36:23:d0:53:85:5e(drw01v) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.731317 Captured frame from MAC (f6:db:39:c1:ae:9b) to (86:94:d9:aa:0f:58) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.735991 Captured frame from MAC (96:37:e3:51:ce:27) to (a2:47:a2:a0:09:28) associated with another peer b6:8c:12:9f:d3:cf(artlva1drw02v) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.740934 Captured frame from MAC (f6:db:39:c1:ae:9b) to (e2:5f:75:3d:ba:82) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.758158 Captured frame from MAC (5a:fe:0f:49:cd:3a) to (ba:a6:c4:64:fb:8d) associated with another peer b6:8c:12:9f:d3:cf(artlva1drw02v) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.759996 Captured frame from MAC (0a:0e:13:a2:81:87) to (46:4f:8f:00:32:c9) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.760969 Captured frame from MAC (2e:9e:0d:47:d6:70) to (0a:71:e9:81:60:2a) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.762235 Captured frame from MAC (8e:64:30:b3:b5:da) to (a2:47:a2:a0:09:28) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.772611 Captured frame from MAC (82:d5:71:39:e4:35) to (0a:30:22:75:25:16) associated with another peer 72:b3:cf:99:6a:4f(drw09) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.777031 Received packet for unknown destination: 4a:4d:82:27:1a:19(drd06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.779902 Captured frame from MAC (a2:9f:27:4a:ac:59) to (ba:a6:c4:64:fb:8d) associated with another peer 36:23:d0:53:85:5e(drw01v) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.782302 Received packet for unknown destination: 4a:4d:82:27:1a:19(drd06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.782503 Removed unreachable peer 4a:4d:82:27:1a:19(drd06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.782531 Removed unreachable peer c6:2f:d0:e1:0e:1c(drc02) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.782610 [nameserver 16:fa:31:50:ca:ee] peer 4a:4d:82:27:1a:19 gone Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: INFO: 2024/07/12 05:51:57.782788 [nameserver 16:fa:31:50:ca:ee] peer c6:2f:d0:e1:0e:1c gone Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.801039 Captured frame from MAC (1a:31:26:1e:5f:3c) to (a2:56:60:b7:0a:7d) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.875942 Captured frame from MAC (7e:a6:db:69:5e:55) to (a2:47:a2:a0:09:28) associated with another peer b2:b3:36:b4:26:17(drd02) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.878600 Captured frame from MAC (76:8c:85:5c:07:3a) to (0a:30:22:75:25:16) associated with another peer 62:00:4b:57:18:78(drw04) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.882100 Captured frame from MAC (6e:bc:7c:5f:3e:71) to (26:92:b3:13:cb:e3) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.894657 Captured frame from MAC (ee:10:fd:91:00:d0) to (1a:9a:c7:05:3f:73) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.902000 Captured frame from MAC (76:06:63:1c:82:95) to (e2:5f:75:3d:ba:82) associated with another peer 72:b3:cf:99:6a:4f(drw09) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.906528 Captured frame from MAC (36:3f:90:d9:74:d1) to (ba:a6:c4:64:fb:8d) associated with another peer a6:31:78:8d:d1:02(drd08) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.941386 Captured frame from MAC (72:1d:8b:3c:4a:91) to (76:e7:a5:4e:17:2a) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.942742 Captured frame from MAC (72:1d:8b:3c:4a:91) to (e2:5f:75:3d:ba:82) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.950593 Captured frame from MAC (6a:fe:60:6f:57:50) to (fe:d6:79:a1:41:f7) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.953845 Captured frame from MAC (f6:db:39:c1:ae:9b) to (16:1a:c1:2e:1c:44) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.959457 Captured frame from MAC (12:cc:22:5e:74:56) to (6a:64:0c:0e:53:77) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.992000 Captured frame from MAC (1a:d6:43:2d:21:74) to (0a:30:22:75:25:16) associated with another peer 42:54:8b:e3:d1:c0(drw05) Jul 12 05:51:57 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:57.994428 Captured frame from MAC (72:14:bc:9a:8f:c7) to (a2:56:60:b7:0a:7d) associated with another peer 42:54:8b:e3:d1:c0(drw05) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.002077 Captured frame from MAC (1a:d6:43:2d:21:74) to (96:5c:71:48:5b:9a) associated with another peer 42:54:8b:e3:d1:c0(drw05) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.018549 Captured frame from MAC (de:da:00:62:58:18) to (72:6e:1d:43:85:d7) associated with another peer ee:d7:04:9f:09:90(drw03) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.021145 Captured frame from MAC (ee:cc:b9:36:f6:43) to (a2:47:a2:a0:09:28) associated with another peer 72:b3:cf:99:6a:4f(drw09) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.021795 Captured frame from MAC (e6:21:56:94:23:2c) to (aa:bc:e3:5b:4d:a9) associated with another peer 62:00:4b:57:18:78(drw04) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.030644 Captured frame from MAC (5a:fe:0f:4MAC (72:63:27:1e:5a:0b) to (16:1a:c1:2e:1c:44) associated with another peer a6:31:78:8d:d1:02(drd08) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.090458 Captured frame from MAC (1a:31:26:1e:5f:3c) to (3a:be:15:b6:f9:f7) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.113999 Captured frame from MAC (d2:b5:99:98:9b:99) to (96:5c:71:48:5b:9a) associated with another peer ee:14:7d:c4:d8:6b(drd01) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.114090 Captured frame from MAC (d2:25:db:ab:98:11) to (0a:71:e9:81:60:2a) associated with another peer ee:14:7d:c4:d8:6b(drd01) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.114221 Captured frame from MAC (1a:31:26:1e:5f:3c) to (1a:9a:c7:05:3f:73) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.119484 Captured frame from MAC (9e:46:db:78:89:55) to (a2:56:60:b7:0a:7d) associated with another peer a6:31:78:8d:d1:02(drd08) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.119553 Captured frame from MAC (9e:46:db:78:89:55) to (ae:17:8a:86:c0:51) associated with another peer a6:31:78:8d:d1:02(drd08) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.120161 Captured frame from MAC (4e:1b:de:40:87:ac) to (3a:be:15:b6:f9:f7) associated with another peer e2:e8:5e:7e:3c:40(drd02) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.131048 Captured frame from MAC (2e:fb:31:5d:7d:f4) to (a2:47:a2:a0:09:28) associated with another peer ee:d7:04:9f:09:90(drw03) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.139629 Captured frame from MAC (6e:a0:cc:d2:a7:2c) to (ae:42:3d:6c:eb:6b) associated with another peer ba:60:6d:17:86:cc(drd07) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.140913 Captured frame from MAC (86:d9:eb:05:01:85) to (ce:85:df:09:b8:6c) associated with another peer a2:08:24:39:97:51(drw06) Jul 12 05:51:58 drd03 98f12d3f9bd9[21292]: ERRO: 2024/07/12 05:51:58.141398 Captured frame from MAC (76:06:63:1c:82:95) to (e2:e3:90:4f:e2:0c) associated with another peer 72:b3:cf:99:6a:4f(drw09)

rajch commented 3 months ago

This log looks normal. You can see a downed node being removed, from "Unable to find connection to relay peer c6:2f:d0:e1:0e:1c" to "Removed unreachable peer c6:2f:d0:e1:0e:1c(drc02)" to "[nameserver 16:fa:31:50:ca:ee] peer c6:2f:d0:e1:0e:1c gone". It seems that the topology update is not yet complete, but it should be in a while. Are you facing any application-level problems at this time?

megarajan commented 3 months ago

yes, while this is happening , we are having issues in our consul cluster as well and due to combination of both these the application is having downtime

rajch commented 3 months ago

I'll try to reproduce this with consul on a test cluster. Meanwhile, could we try increasing the connlimit to 625?

megarajan commented 3 months ago

yes , will try increasing connlimit to 625 .

My understanding is that increasing connlimit to 625 will reduce the no of toplogy updates and will decrease the convergence time also

is that correct ?

rajch commented 3 months ago

It will increase the possibility of decreasing the convergence time, because topology updates are likely to complete faster with more connections available. I'm sorry, but that is all I can promise at the moment.

megarajan commented 3 months ago

is there a way to measure the convergence time ?

any kpi to measure convergence times / when convergence is completed ?

If I increase the connlimit to 625 how do we say that we are seeing an improvement in convergence time ?

rajch commented 3 months ago

No, there isn't. One possibility is to watch weave_ipam _unreachable* metrics over time, and with different connlimits. See here for how.

megarajan commented 3 months ago

ok thanks for the info. will try.