Closed adrianchiris closed 1 year ago
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
pkg/utils/mocks/netlink_manager_mock.go | 1 | 79.72% | ||
<!-- | Total: | 1 | --> |
Totals | |
---|---|
Change from base Build 5326045906: | -1.4% |
Covered Lines: | 434 |
Relevant Lines: | 1128 |
That's correct @adrianchiris. Thanks for pointing it out.
The reason to have the allmulticast here (besides facilitating configuration) was to have additional checks related to sriov/dpdk. Mainly, the setting of the trust parameter. It is possible to enable allmulticast on the VF without trust being on. No error is returned, the device simply won't enter to the allmulti rx mode. We wanted to restrict this and avoid having users coming up to us asking why allmulti doesn't work. Also, for non netdev VF, we wanted to return a specific error saying that it is not possible to set it on for dpdk drivers (although my initial idea was to ignore it) instead of having the tuning plugin return "failed to get ifname".
I understand your point. It's valid. We should have discussed this initially.
Mainly, the setting of the trust parameter. It is possible to enable allmulticast on the VF without trust being on. No error is returned, the device simply won't enter to the allmulti rx mode.
This is not the case for MLNX NICs,
PF:
# ip l show enp3s0f0np0
4: enp3s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b8:ce:f6:09:fa:38 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:aa:bb:cc:dd:ee brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
VF 0:
# ip l show enp3s0f0v0
37: enp3s0f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether d6:26:8f:3d:79:82 brd ff:ff:ff:ff:ff:ff permaddr f2:55:9a:20:11:88
Set all multicast:
# ip l set enp3s0f0v0 allmulticast on
# ip l show enp3s0f0v0
37: enp3s0f0v0: <BROADCAST,MULTICAST,ALLMULTI> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether d6:26:8f:3d:79:82 brd ff:ff:ff:ff:ff:ff permaddr f2:55:9a:20:11:88
So it seems to be driver implementation specific, and should not be enforced in sriov-cni.
for non netdev VF, we wanted to return a specific error saying that it is not possible to set it on for dpdk drivers (although my initial idea was to ignore it) instead of having the tuning plugin return "failed to get ifname".
i dont think this one is critical to mandate the logic being in sriov-cni. users who run dpdk applications are expected to know their desired cni configuration. there are also other parameters in cni that are not applicable to dpdk.
I understand your point. It's valid. We should have discussed this initially.
Indeed, please do so next time.
Both sides make good points. I personally am more inclined to have the option in the SR-IOV CNI (and SR-IOV Network Operator) for the same reasons @mlguerrero12 mentioned (greater validation and user experience). Parameters mtu
and mac
exist in both SR-IOV and Tuning CNIs. Requiring users to now also use the Tuning CNI for SR-IOV VFs with all the disadvantages Marcelo highlighted would lessen user experience.
Also, requiring cluster administrators to support and allow users to use the Tuning CNI can pose a security risk as it allows any sysctl to be changed (the newly introduced sysctl allow list feature requires could ease security attack surface but still).
@cgoncalves mac
and mtu
are related to sriov as there are specific parameters on the PF that should be applied on behalf of the VF. these are then applied on the VF netdev to be in sync.
allmulticast apparently is NOT related to to sriov, but is general for netdev. will we now add all netdev configurable parameters to sriov-cni ?
Also, requiring cluster administrators to support and allow users to use the Tuning CNI can pose a security risk as it allows any sysctl to be changed (the newly introduced sysctl allow list feature requires could ease security attack surface but still).
Administrators should define the networks and not let users define them :)
you can configure it, but does it work? This is from the nvidia docs
"VF All-Multi Mode VFs can enter an all-multi mode that enables receiving all the multicast traffic sent from/to the other functions on the same physical port in addition to the traffic originally targeted to the VF. Note: Only privileged/trusted VFs can enter the all-multi RX mode."
https://docs.nvidia.com/networking/pages/viewpage.action?pageId=61869558
it didn't work for me, but if it works for you, then yeah, let's revert it. That was my main motivation.
From my side again if it as a correlation with something that is part of the sriov-cni then it must be in the sriov-cni. BUT if it's possible to configure it on any netdevice VF or macvlan for example then we should use the implementation from the tuning-cni.
Just to be clear, I'm claiming that it is possible to configure it (allmulti flag is set) but it doesn't really work
Just to be clear, I'm claiming that it is possible to configure it (allmulti flag is set) but it doesn't really work
then it's even worse than getting an error if you ask me :P
you can configure it, but does it work?
how do i test this one ? do you have some instructions ?
if it doesnt, and it silently fails, then still id expect driver to fix it (and actually fail when turning on all-multi) (ill create an internal ticket for them to address)
Using a mellanox nic, I have
POD A 28: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 22:f9:f4:98:bc:ce brd ff:ff:ff:ff:ff:ff vf 3 link/ether 22:f9:f4:98:bc:ce brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust on, query_rss off
POD B 25: net1: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether b2:8c:90:d2:ae:71 brd ff:ff:ff:ff:ff:ff vf 0 link/ether b2:8c:90:d2:ae:71 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
POD C 26: net1: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ba:8c:f7:c3:1c:1b brd ff:ff:ff:ff:ff:ff vf 1 link/ether ba:8c:f7:c3:1c:1b brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust on, query_rss off
Pod A pings a multicast address sh-4.4# ping 224.10.10.10 -I net1 ping: Warning: source address might be selected on device other than net1. PING 224.10.10.10 (224.10.10.10) from 10.132.3.231 net1: 56(84) bytes of data.
Pod B doesn't see this (nor other mulitcast traffic) sh-4.4# tcpdump -i net1 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on net1, link-type EN10MB (Ethernet), capture size 262144 bytes 08:02:56.598926 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 08:c0:eb:96:32:2c (oui Unknown), length 276 08:02:56.711677 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:73:18 (oui Unknown), length 276 08:02:56.796698 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:71:a1 (oui Unknown), length 276 08:02:57.306474 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:71:a0 (oui Unknown), length 276 08:02:57.577047 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:78:a9 (oui Unknown), length 285 08:02:57.595346 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:73:19 (oui Unknown), length 276 08:02:57.720536 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:78:a8 (oui Unknown), length 285 08:02:58.262794 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:75:f9 (oui Unknown), length 285
Pod C sees this and ipv6 router solicitations which is what our customer is asking for sh-4.4# tcpdump -i net1 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on net1, link-type EN10MB (Ethernet), capture size 262144 bytes 08:02:55.875767 IP 10.132.3.231 > 224.10.10.10: ICMP echo request, id 6, seq 462, length 64 08:02:55.997663 IP6 fe80::b696:91ff:feb4:71a1 > ff02::2: ICMP6, router solicitation, length 8 08:02:56.034849 IP6 fe80::ac0:ebff:fe96:322c > ff02::2: ICMP6, router solicitation, length 8 08:02:56.328419 IP6 fe80::b696:91ff:feb4:71a0 > ff02::2: ICMP6, router solicitation, length 8 08:02:56.353606 IP6 fe80::8cf4:f52d:7477:b838 > ff02::2: ICMP6, router solicitation, length 8 08:02:56.598889 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 08:c0:eb:96:32:2c (oui Unknown), length 276 08:02:56.711682 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:73:18 (oui Unknown), length 276 08:02:56.796705 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b4:96:91:b4:71:a1 (oui Unknown), length 276 08:02:56.880993 STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 8000.e0:f6:2d:60:0f:d0.820c, length 43 08:02:56.899712 IP 10.132.3.231 > 224.10.10.10: ICMP echo request, id 6, seq 463, length 64 08:02:57.012966 IP6 fe80::f446:e743:f09b:e8e6 > ff02::2: ICMP6, router solicitation, length 8 08:02:57.223575 IP6 fe80::79a5:a9dc:6516:618 > ff02::2: ICMP6, router solicitation, length 8
i see, thx for checking this. so it behaves the same for mlnx (silently fails).
does this mean that you are ok with this or that you will push internally for a fix? By the way, I saw the same behavior with an Intel card.
so I wanted to check this on my side and review properly as well:
3: ens785f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b4:96:91:b4:44:00 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether fe:fc:cf:f2:9d:93 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 1 link/ether 72:2f:d7:c4:e5:69 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 2 link/ether 6a:f3:a7:2d:77:44 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3 link/ether d2:bf:70:02:a5:85 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
altname enp75s0f0
~# ip l show ens785f0v0
176: ens785f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fe:fc:cf:f2:9d:93 brd ff:ff:ff:ff:ff:ff
altname enp75s0f0v0
~# ip l set enp75s0f0v0 allmulticast off
~# ip l set enp75s0f0v0 allmulticast on
~# ip l show ens785f0v0
176: ens785f0v0: <BROADCAST,MULTICAST,ALLMULTI> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fe:fc:cf:f2:9d:93 brd ff:ff:ff:ff:ff:ff
altname enp75s0f0v0
~#
so I am able to set allmulticast to on, even with trust off.
@mlguerrero12 you mentioned you saw similar issue with Intel cards too right? if so, I would like to know the card/driver you were testing with, I want to log an internal ticket to address this if so. Assuming E800 series and IAVF driver I'd also like to run the quick test of sending the multicast traffic.
as for adding allmulticast option to sriov-cni, since it doesn't relate specifically to sriov, and just to netdev in general, I am not sure it should be added here...
this sounds like a driver issue, which should log this failure instead of being silent.
@Eoghan1232
17:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for SFP (rev 02) driver: iavf
Indeed, it is possible to configure it, but the interface won't see the multicast traffic. Please test it from your side.
I totally agree. If this can be solved at the driver level, then we don't need this check. The main motivation of the PR was to address this issue.
@SchSeba @zeeke @mlguerrero12
ive rebased PR to resolve conflicts. LMK if we can move forward with this revert.
LGTM
@SchSeba you on board with reverting this ?
Reverts k8snetworkplumbingwg/sriov-cni#268
tuning CNI already provides a way to set/resotre all multicast for a netdev. this configuration is general for netdevs not necessarily VFs
https://github.com/containernetworking/cni.dev/blob/main/content/plugins/current/meta/tuning.md#interface-attribute-configuration-reference
my preference is to not bloat sriov-cni config if there is an alternative.
@SchSeba @Eoghan1232 @mlguerrero12