sflow / host-sflow

host-sflow agent
http://sflow.net
Other
146 stars 55 forks source link

The agentIP selection is wrong when multi devices have same IPv4/IPv6 address #61

Open shuaishang opened 5 months ago

shuaishang commented 5 months ago

We used hsflowd in SONiC.

Issue: The IPv4 address priority should be higher than IPv6 per hsflowd design "EnumIPSelectionPriority".

But it selected the IPv6 address wrongly:

root@MC-54:/# cat /etc/hsflowd.auto
# WARNING: Do not edit this file. It is generated automatically by hsflowd.
rev_start=2
hostname=MC-54
sampling=400
header=128
datagram=1400
polling=20
agentIP=fd00:0:201::5
agent=Loopback0
ds_index=1
collector=26.34.15.106/6343//
rev_end=2

The device Loopback0, Loopback1001, Loopback1002 belong to different VRF so then can have same IPv4/IPv6 address:

root@MC-54:~# ip addr show dev Loopback0
35: Loopback0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 8a:5f:78:e1:3b:5d brd ff:ff:ff:ff:ff:ff
    inet 10.145.240.15/32 scope global Loopback0
       valid_lft forever preferred_lft forever
    inet6 fd00:0:201::5/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::885f:78ff:fee1:3b5d/64 scope link
       valid_lft forever preferred_lft forever
root@MC-54:~#
root@MC-54:~# ip addr show dev Loopback1001
212: Loopback1001: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65536 qdisc noqueue master Vrf10002 state UNKNOWN group default qlen 1000
    link/ether 1a:4d:0d:10:8d:35 brd ff:ff:ff:ff:ff:ff
    inet 10.145.240.15/32 scope global Loopback1001
       valid_lft forever preferred_lft forever
    inet6 fd00:0:201::5/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::184d:dff:fe10:8d35/64 scope link
       valid_lft forever preferred_lft forever
root@MC-54:~#
root@MC-54:~# ip addr show dev Loopback1002
213: Loopback1002: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65536 qdisc noqueue master Vrf10006 state UNKNOWN group default qlen 1000
    link/ether 02:1f:9f:c1:6d:25 brd ff:ff:ff:ff:ff:ff
    inet 10.145.240.15/32 scope global Loopback1002
       valid_lft forever preferred_lft forever
    inet6 fd00:0:201::5/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::1f:9fff:fec1:6d25/64 scope link
       valid_lft forever preferred_lft forever
root@MC-54:~#

But in function "readInterfaces", the HASH key of localIP/localIP6 has only IPv4/IPv6 address without dev/ifname.

  // keep v4 and v6 separate to simplify HT logic
  UTHash *newLocalIP = UTHASH_NEW(HSPLocalIP, ipAddr.address.ip_v4, UTHASH_DFLT);
  UTHash *newLocalIP6 = UTHASH_NEW(HSPLocalIP, ipAddr.address.ip_v6, UTHASH_DFLT);

In our example, Loopback0, Loopback1001, Loopback1002 have same IPv4 address "10.145.240.15/32". But after "readInterfaces", the "localIP" has only one "10.145.240.15/32" for "Loopback1001". Then agent "Loopback0" can't select correct agentIP.

sflow commented 5 months ago

At first I thought this sounded like it might be running a version of hsflowd older than 2.0.39-9, when a correction was made to the automatic agent-address selection priorities. But on closer inspection the same thing might happen even with the latest version. Why? Because 10.145.240.15 is an RFC1918 address (which could easily be non-unique across a large network with multiple LANs) while fd00:0:201::5/64 has scope "global" and is therefore preferred as a more-likely-to-be-unique ID for the switch.

In SONiC the fix is to just tell the switch what it's agent address should be. It's a CLI option. That overrides the automatic selection. Will that work for you?

Note: another way to do this would be to put a thumb on the scale by adding something like:

agent.cidr=10.245.0.0/16

to the file /etc/hsflowd.conf inside the sflow container - which bumps up the priority of 10.245.0.0/16 addresses. But that seems awkward for SONiC. That way only really makes sense when the hsflowd.conf config file is easily accessible and is being set by something like Puppet, Kubernetes or DNS-SD.

shuaishang commented 5 months ago

The issue is that automatic agent-address selection doesn't work when there are same address for different interfaces, since the Hash key of LocalIP has no interface name. (it doesn't matter which IP has higher priority, IP4_RFC1918, IP6_GLOBAL, or CIDR)

Configure agentIP explicitly is ok, however it's a new feature for SONiC...

sflow commented 2 months ago

I see what you mean now. The setting "agent=Loopback0" is supposed to boost the chances of a Loopback0 address being chosen here: https://github.com/sflow/host-sflow/blob/v2.0/src/Linux/hsflowconfig.c#L1117-L1122 but the HSPLocalIP object for "10.145.240.15" has only one dev, and that can end up being "Loopback1001" or "Loopback1002", so the priority boost does not happen.

This example is a little confusing because even if it worked correctly it might still have picked the global fd00:0:201::5 address, but I can see that there might be some scenario where the address is chosen wrongly because the boost is not applied.

sflow commented 2 months ago

I believe this is now addressed in master-branch. Let me know if you need a release to test.

sflow commented 1 month ago

There is now a release that has this fix: 2.1.03-1.