k8snetworkplumbingwg / sriov-cni

DPDK & SR-IOV CNI plugin
Apache License 2.0
307 stars 146 forks source link

Revert "Add support for allmulticast flag" #272

Closed adrianchiris closed 1 year ago

adrianchiris commented 1 year ago

Reverts k8snetworkplumbingwg/sriov-cni#268

tuning CNI already provides a way to set/resotre all multicast for a netdev. this configuration is general for netdevs not necessarily VFs

https://github.com/containernetworking/cni.dev/blob/main/content/plugins/current/meta/tuning.md#interface-attribute-configuration-reference

my preference is to not bloat sriov-cni config if there is an alternative.

@SchSeba @Eoghan1232 @mlguerrero12

github-actions[bot] commented 1 year ago

Pull Request Test Coverage Report for Build 5330893244


Files with Coverage Reduction New Missed Lines %
pkg/utils/mocks/netlink_manager_mock.go 1 79.72%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 5326045906: -1.4%
Covered Lines: 434
Relevant Lines: 1128

💛 - Coveralls
mlguerrero12 commented 1 year ago

That's correct @adrianchiris. Thanks for pointing it out.

The reason to have the allmulticast here (besides facilitating configuration) was to have additional checks related to sriov/dpdk. Mainly, the setting of the trust parameter. It is possible to enable allmulticast on the VF without trust being on. No error is returned, the device simply won't enter to the allmulti rx mode. We wanted to restrict this and avoid having users coming up to us asking why allmulti doesn't work. Also, for non netdev VF, we wanted to return a specific error saying that it is not possible to set it on for dpdk drivers (although my initial idea was to ignore it) instead of having the tuning plugin return "failed to get ifname".

I understand your point. It's valid. We should have discussed this initially.

adrianchiris commented 1 year ago

Mainly, the setting of the trust parameter. It is possible to enable allmulticast on the VF without trust being on. No error is returned, the device simply won't enter to the allmulti rx mode.

This is not the case for MLNX NICs,

PF:

# ip l show enp3s0f0np0
4: enp3s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether b8:ce:f6:09:fa:38 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:aa:bb:cc:dd:ee brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 2     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 3     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off

VF 0:

# ip l show enp3s0f0v0
37: enp3s0f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether d6:26:8f:3d:79:82 brd ff:ff:ff:ff:ff:ff permaddr f2:55:9a:20:11:88

Set all multicast:

# ip l set enp3s0f0v0 allmulticast on
# ip l show enp3s0f0v0
37: enp3s0f0v0: <BROADCAST,MULTICAST,ALLMULTI> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether d6:26:8f:3d:79:82 brd ff:ff:ff:ff:ff:ff permaddr f2:55:9a:20:11:88

So it seems to be driver implementation specific, and should not be enforced in sriov-cni.

for non netdev VF, we wanted to return a specific error saying that it is not possible to set it on for dpdk drivers (although my initial idea was to ignore it) instead of having the tuning plugin return "failed to get ifname".

i dont think this one is critical to mandate the logic being in sriov-cni. users who run dpdk applications are expected to know their desired cni configuration. there are also other parameters in cni that are not applicable to dpdk.

I understand your point. It's valid. We should have discussed this initially.

Indeed, please do so next time.

cgoncalves commented 1 year ago

Both sides make good points. I personally am more inclined to have the option in the SR-IOV CNI (and SR-IOV Network Operator) for the same reasons @mlguerrero12 mentioned (greater validation and user experience). Parameters mtu and mac exist in both SR-IOV and Tuning CNIs. Requiring users to now also use the Tuning CNI for SR-IOV VFs with all the disadvantages Marcelo highlighted would lessen user experience.

Also, requiring cluster administrators to support and allow users to use the Tuning CNI can pose a security risk as it allows any sysctl to be changed (the newly introduced sysctl allow list feature requires could ease security attack surface but still).

adrianchiris commented 1 year ago

@cgoncalves mac and mtu are related to sriov as there are specific parameters on the PF that should be applied on behalf of the VF. these are then applied on the VF netdev to be in sync.

allmulticast apparently is NOT related to to sriov, but is general for netdev. will we now add all netdev configurable parameters to sriov-cni ?

Also, requiring cluster administrators to support and allow users to use the Tuning CNI can pose a security risk as it allows any sysctl to be changed (the newly introduced sysctl allow list feature requires could ease security attack surface but still).

Administrators should define the networks and not let users define them :)

mlguerrero12 commented 1 year ago

you can configure it, but does it work? This is from the nvidia docs

"VF All-Multi Mode VFs can enter an all-multi mode that enables receiving all the multicast traffic sent from/to the other functions on the same physical port in addition to the traffic originally targeted to the VF. Note: Only privileged/trusted VFs can enter the all-multi RX mode."

https://docs.nvidia.com/networking/pages/viewpage.action?pageId=61869558

mlguerrero12 commented 1 year ago

it didn't work for me, but if it works for you, then yeah, let's revert it. That was my main motivation.

SchSeba commented 1 year ago

From my side again if it as a correlation with something that is part of the sriov-cni then it must be in the sriov-cni. BUT if it's possible to configure it on any netdevice VF or macvlan for example then we should use the implementation from the tuning-cni.

mlguerrero12 commented 1 year ago

Just to be clear, I'm claiming that it is possible to configure it (allmulti flag is set) but it doesn't really work

SchSeba commented 1 year ago

Just to be clear, I'm claiming that it is possible to configure it (allmulti flag is set) but it doesn't really work

then it's even worse than getting an error if you ask me :P

adrianchiris commented 1 year ago

you can configure it, but does it work?

how do i test this one ? do you have some instructions ?

if it doesnt, and it silently fails, then still id expect driver to fix it (and actually fail when turning on all-multi) (ill create an internal ticket for them to address)

mlguerrero12 commented 1 year ago

Using a mellanox nic, I have

POD A 28: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 22:f9:f4:98:bc:ce brd ff:ff:ff:ff:ff:ff vf 3 link/ether 22:f9:f4:98:bc:ce brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust on, query_rss off

POD B 25: net1: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether b2:8c:90:d2:ae:71 brd ff:ff:ff:ff:ff:ff vf 0 link/ether b2:8c:90:d2:ae:71 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off

POD C 26: net1: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ba:8c:f7:c3:1c:1b brd ff:ff:ff:ff:ff:ff vf 1 link/ether ba:8c:f7:c3:1c:1b brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust on, query_rss off

Pod A pings a multicast address sh-4.4# ping 224.10.10.10 -I net1 ping: Warning: source address might be selected on device other than net1. PING 224.10.10.10 (224.10.10.10) from 10.132.3.231 net1: 56(84) bytes of data.

Pod B doesn't see this (nor other mulitcast traffic) sh-4.4# tcpdump -i net1 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on net1, link-type EN10MB (Ethernet), capture size 262144 bytes 08:02:56.598926 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 08:c0:eb:96:32:2c (oui Unknown), length 276 08:02:56.711677 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:73:18 (oui Unknown), length 276 08:02:56.796698 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:71:a1 (oui Unknown), length 276 08:02:57.306474 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:71:a0 (oui Unknown), length 276 08:02:57.577047 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:78:a9 (oui Unknown), length 285 08:02:57.595346 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:73:19 (oui Unknown), length 276 08:02:57.720536 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:78:a8 (oui Unknown), length 285 08:02:58.262794 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:75:f9 (oui Unknown), length 285

Pod C sees this and ipv6 router solicitations which is what our customer is asking for sh-4.4# tcpdump -i net1 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on net1, link-type EN10MB (Ethernet), capture size 262144 bytes 08:02:55.875767 IP 10.132.3.231 > 224.10.10.10: ICMP echo request, id 6, seq 462, length 64 08:02:55.997663 IP6 fe80::b696:91ff:feb4:71a1 > ff02::2: ICMP6, router solicitation, length 8 08:02:56.034849 IP6 fe80::ac0:ebff:fe96:322c > ff02::2: ICMP6, router solicitation, length 8 08:02:56.328419 IP6 fe80::b696:91ff:feb4:71a0 > ff02::2: ICMP6, router solicitation, length 8 08:02:56.353606 IP6 fe80::8cf4:f52d:7477:b838 > ff02::2: ICMP6, router solicitation, length 8 08:02:56.598889 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 08:c0:eb:96:32:2c (oui Unknown), length 276 08:02:56.711682 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:73:18 (oui Unknown), length 276 08:02:56.796705 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:71:a1 (oui Unknown), length 276 08:02:56.880993 STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 8000.e0:f6:2d:60:0f:d0.820c, length 43 08:02:56.899712 IP 10.132.3.231 > 224.10.10.10: ICMP echo request, id 6, seq 463, length 64 08:02:57.012966 IP6 fe80::f446:e743:f09b:e8e6 > ff02::2: ICMP6, router solicitation, length 8 08:02:57.223575 IP6 fe80::79a5:a9dc:6516:618 > ff02::2: ICMP6, router solicitation, length 8

adrianchiris commented 1 year ago

i see, thx for checking this. so it behaves the same for mlnx (silently fails).

mlguerrero12 commented 1 year ago

does this mean that you are ok with this or that you will push internally for a fix? By the way, I saw the same behavior with an Intel card.

adrianchiris commented 1 year ago
  1. still think we should not have this config part of sriov-cni, ill discuss with maintainers in tomorrows meeting.
  2. yes will open a bug internally, see what i get from driver team.
  3. allMulti + vf trust can be checked in sriov-network-operator (the check would need to ensure vf trust is enabled if tuning cni config is providerd as meta plugin with all multi enabled
Eoghan1232 commented 1 year ago

so I wanted to check this on my side and review properly as well:

3: ens785f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether b4:96:91:b4:44:00 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether fe:fc:cf:f2:9d:93 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 1     link/ether 72:2f:d7:c4:e5:69 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 2     link/ether 6a:f3:a7:2d:77:44 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 3     link/ether d2:bf:70:02:a5:85 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    altname enp75s0f0
~# ip l show ens785f0v0
176: ens785f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether fe:fc:cf:f2:9d:93 brd ff:ff:ff:ff:ff:ff
    altname enp75s0f0v0
~# ip l set enp75s0f0v0 allmulticast off
~# ip l set enp75s0f0v0 allmulticast on
~# ip l show ens785f0v0
176: ens785f0v0: <BROADCAST,MULTICAST,ALLMULTI> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether fe:fc:cf:f2:9d:93 brd ff:ff:ff:ff:ff:ff
    altname enp75s0f0v0
~#

so I am able to set allmulticast to on, even with trust off.

@mlguerrero12 you mentioned you saw similar issue with Intel cards too right? if so, I would like to know the card/driver you were testing with, I want to log an internal ticket to address this if so. Assuming E800 series and IAVF driver I'd also like to run the quick test of sending the multicast traffic.

as for adding allmulticast option to sriov-cni, since it doesn't relate specifically to sriov, and just to netdev in general, I am not sure it should be added here...

this sounds like a driver issue, which should log this failure instead of being silent.

mlguerrero12 commented 1 year ago

@Eoghan1232

17:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for SFP (rev 02) driver: iavf

Indeed, it is possible to configure it, but the interface won't see the multicast traffic. Please test it from your side.

I totally agree. If this can be solved at the driver level, then we don't need this check. The main motivation of the PR was to address this issue.

adrianchiris commented 1 year ago

@SchSeba @zeeke @mlguerrero12

ive rebased PR to resolve conflicts. LMK if we can move forward with this revert.

mlguerrero12 commented 1 year ago

LGTM

adrianchiris commented 1 year ago

@SchSeba you on board with reverting this ?