Closed RefluxMeds closed 5 months ago
I see some work was done in: https://github.com/k8snetworkplumbingwg/sriov-cni/issues/133.
The issue contained a proposal for configuring sriov-cni
. However, I do not demand or need a configuration option, simply make sriov-cni
stop trying to apply VLAN to VF when VLAN ID is 0.
Hi @RefluxMeds thanks for the input yes please if you can explain about more the use-case this will be great!
is the trunk with allow vlan configure a driver specific?
Hello @SchSeba !
We're trying to solve a specific security issue in our infrastructure - prohibiting Pod workloads and other host processes from provisioning any VLAN ID they want on assigned virtual functions. This issue stems from our network fabric setup, but I cannot get into details here as it would be too revealing. Also this feature would resolve our platform compliance with the company-wide security policy, so I am really really interested in pursuing it.
We are using Mellanox ConnectX-6 (Lx/Dx) and ConnectX-5 NIC on all our servers. These NICs provide a really cool feature we would like to use, however this is not supported with the current implementation of sriov-cni
.
To describe the issue, I will post a somewhat lengthy explainer on how we got to here. I apologize in advance if it is too long.
Let us take we have two physical functions (PF), with 8 virtual functions (VF) per PF:
4: ens3f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc prio state UP mode DEFAULT group default qlen 1000
link/ether b8:3f:d2:3c:6c:78 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 4 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 5 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 6 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 7 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc prio state UP mode DEFAULT group default qlen 1000
link/ether b8:3f:d2:3c:6c:79 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 4 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 5 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 6 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 7 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
These VFs can be consumed by Pod workloads using the sriov-cni
.
To consume a VF I create a NetworkAttachmentDefinition that has a VLAN ID defined:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: mellanox.com/mellanox_mlx5_dpdk_right
name: mlnx-right
namespace: test1
spec:
config: '{ "cniVersion": "0.3.1", "type": "sriov", "name": "sriov-mlnx-connectx5",
"spoofchk": "off", "trust": "on", "vlan": 1082 }'
In the Deployment spec, I refer to this NetworkAttachmentDefinition to configure the VF and move it into the Pod network namespace:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test-router
name: test-router
namespace: test1
spec:
replicas: 1
selector:
matchLabels:
app: test-router
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: '[ { "name": "mlnx-right", "namespace": "test1" } ]'
labels:
app: test-router
spec:
nodeSelector:
kubernetes.io/hostname: my-worker
containers:
- image: docker.io/mylinuximage:latest
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
name: ubuntu
resources:
limits:
cpu: 1000m
mellanox.com/mellanox_mlx5_dpdk_right: "1"
memory: 4Gi
requests:
cpu: 300m
mellanox.com/mellanox_mlx5_dpdk_right: "1"
memory: 1Gi
The Pod gets spawned up with allocated VF, I can assign IP and ping gateway for that VLAN:
$ kubectl -n test1 exec -it test-router-69945bd84f-cx48d -- bash
root@test-router-69945bd84f-cx48d:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if3966: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 56:40:ad:08:a8:e4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.214/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::5440:adff:fe08:a8e4/64 scope link
valid_lft forever preferred_lft forever
26: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether b6:d7:f9:ac:c9:76 brd ff:ff:ff:ff:ff:ff
inet6 fe80::b4d7:f9ff:feac:c976/64 scope link
valid_lft forever preferred_lft forever
root@test-router-69945bd84f-cx48d:~# ip addr add 172.21.0.52/28 brd 172.21.0.63 dev net1
root@test-router-69945bd84f-cx48d:~# ping -I net1 172.21.0.49
PING 172.21.0.49 (172.21.0.49) from 172.21.0.52 net1: 56(84) bytes of data.
64 bytes from 172.21.0.49: icmp_seq=1 ttl=64 time=12.6 ms
64 bytes from 172.21.0.49: icmp_seq=2 ttl=64 time=2.38 ms
64 bytes from 172.21.0.49: icmp_seq=3 ttl=64 time=0.930 ms
64 bytes from 172.21.0.49: icmp_seq=4 ttl=64 time=0.916 ms
^C
--- 172.21.0.49 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3016ms
rtt min/avg/max/mdev = 0.916/4.221/12.660/4.908 ms
From the host side we can see that one VF is really assigned and all is well:
5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc prio state UP mode DEFAULT group default qlen 1000
link/ether b8:3f:d2:3c:6c:79 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 4 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 5 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 6 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, vlan 1082, spoof checking off, link-state auto, trust on, query_rss off
vf 7 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
This is a standard way that you would consume a virtual function. However, this configuration would match what Nvidia calls VST or VLAN Switch Tagging. This does not fit our use case, as VFs have their VLANs configurable both by NetworkAttachmentDefinition
i.e. sriov-cni
and manually using NETLINK:
root@worker:~# ip link set dev ens3f1 vf 3 vlan 3000
root@worker:~# ip link set dev ens3f1 vf 3 trust on
root@worker:~# ip link | grep "ens3f1:" -A9
5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc prio state UP mode DEFAULT group default qlen 1000
link/ether b8:3f:d2:3c:6c:79 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, vlan 3000, spoof checking off, link-state auto, trust on, query_rss off
vf 4 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 5 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 6 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, vlan 1082, spoof checking off, link-state auto, trust on, query_rss off
vf 7 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
The other way to consume a VF would be to treat it like a real trunk port and delegate VLAN tagging responsibilities into the container itself, while imposing an administrative trunk policy on the assigned VF. This configuration would match what Nvidia calls VGT or VLAN Guest Tagging.
To configure this, we recreate the NetworkAttachmentDefinition without the VLAN ID and restart the Pod. But now we have to create a VLAN tag on the VF through the Pod in order to ping the gateway:
$ kubectl -n test1 exec -it test-router-69945bd84f-tb424 -- bash
root@test-router-69945bd84f-tb424:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if3967: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a2:f6:89:13:cf:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.234/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::a0f6:89ff:fe13:cf2d/64 scope link
valid_lft forever preferred_lft forever
27: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 5e:46:16:56:e3:b9 brd ff:ff:ff:ff:ff:ff
inet6 fe80::5c46:16ff:fe56:e3b9/64 scope link
valid_lft forever preferred_lft forever
root@test-router-69945bd84f-tb424:~# ip link add link net1 name net1.1082 type vlan id 1082
root@test-router-69945bd84f-tb424:~# ip addr add 172.21.0.52/28 brd 172.21.0.63 dev net1.1082
root@test-router-69945bd84f-tb424:~# ip link set dev net1.1082 up
root@test-router-69945bd84f-tb424:~# ping -I net1.1082 172.21.0.49
root@test-router-69945bd84f-tb424:~# ping -I net1.1082 172.21.0.49
PING 172.21.0.49 (172.21.0.49) from 172.21.0.52 net1.1082: 56(84) bytes of data.
64 bytes from 172.21.0.49: icmp_seq=1 ttl=64 time=0.820 ms
64 bytes from 172.21.0.49: icmp_seq=2 ttl=64 time=0.907 ms
64 bytes from 172.21.0.49: icmp_seq=3 ttl=64 time=0.973 ms
64 bytes from 172.21.0.49: icmp_seq=4 ttl=64 time=1.21 ms
^C
--- 172.21.0.49 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3044ms
rtt min/avg/max/mdev = 0.820/0.979/1.219/0.153 ms
This is where VGT+ feature comes in. VGT+ is an advanced mode of virtual guest tagging, in which a VF is allowed to tag its own packets, but is still subject to an administrative VLAN trunk policy. To configure the VF trunk policy on both PFs we run a command that configures it:
root@worker:~# for VF in $(ls -l /sys/class/net/ens3f{0..1}/device/sriov/{0..7}/trunk | awk '{print $9}'); do echo add 1070 1081 > $VF; done
root@worker:~# cat /sys/class/net/ens3f{0..1}/device/sriov/{0..7}/trunk
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
The guest tagged traffic with VLAN ID 1082
will not be sent or received by the VF:
root@test-router-69945bd84f-tb424:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if3967: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a2:f6:89:13:cf:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.234/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::a0f6:89ff:fe13:cf2d/64 scope link
valid_lft forever preferred_lft forever
27: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 5e:46:16:56:e3:b9 brd ff:ff:ff:ff:ff:ff
inet6 fe80::5c46:16ff:fe56:e3b9/64 scope link
valid_lft forever preferred_lft forever
root@test-router-69945bd84f-tb424:~# ip link add link net1 name net1.1082 type vlan id 1082
root@test-router-69945bd84f-tb424:~# ip addr add 172.21.0.52/28 brd 172.21.0.63 dev net1.1082
root@test-router-69945bd84f-tb424:~# ip link set dev net1.1082 up
root@test-router-69945bd84f-tb424:~# ping -I net1.1082 172.21.0.49
PING 172.21.0.49 (172.21.0.49) from 172.21.0.52 net1.1082: 56(84) bytes of data.
^C
--- 172.21.0.49 ping statistics ---
16 packets transmitted, 0 received, 100% packet loss, time 15357ms
pipe 3
We will add an additional allowed VLAN on virtual function trunks and we will configure the same VF with another VLAN tag which is already allowed, to demonstrate that the policy indeed works:
root@worker:~# for VF in $(ls -l /sys/class/net/ens3f{0..1}/device/sriov/{0..7}/trunk | awk '{print $9}'); do echo add 1082 1082 > $VF; done
root@worker:~# cat /sys/class/net/ens3f{0..1}/device/sriov/{0..7}/trunk
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
--- --- --- --- --- --- --- --- ---
--- --- --- --- --- --- --- --- ---
root@test-router-69945bd84f-tb424:~# ip link add link net1 name net1.1082 type vlan id 1082
root@test-router-69945bd84f-tb424:~# ip addr add 172.21.0.52/28 dev net1.1082
root@test-router-69945bd84f-tb424:~# ip link set dev net1.1082 up
root@test-router-69945bd84f-tb424:~# ip link add link net1 name net1.1081 type vlan id 1081
root@test-router-69945bd84f-tb424:~# ip addr add 172.21.0.36/28 dev net1.1081
root@test-router-69945bd84f-tb424:~# ip link set dev net1.1081 up
root@test-router-69945bd84f-tb424:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if3967: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a2:f6:89:13:cf:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.234/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::a0f6:89ff:fe13:cf2d/64 scope link
valid_lft forever preferred_lft forever
8: net1.1082@net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 5e:46:16:56:e3:b9 brd ff:ff:ff:ff:ff:ff
inet 172.21.0.52/28 scope global net1.1082
valid_lft forever preferred_lft forever
inet6 fe80::5c46:16ff:fe56:e3b9/64 scope link
valid_lft forever preferred_lft forever
9: net1.1081@net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 5e:46:16:56:e3:b9 brd ff:ff:ff:ff:ff:ff
inet 172.21.0.36/28 scope global net1.1081
valid_lft forever preferred_lft forever
inet6 fe80::5c46:16ff:fe56:e3b9/64 scope link
valid_lft forever preferred_lft forever
27: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 5e:46:16:56:e3:b9 brd ff:ff:ff:ff:ff:ff
inet6 fe80::5c46:16ff:fe56:e3b9/64 scope link
valid_lft forever preferred_lft forever
root@test-router-69945bd84f-tb424:~# ping -c4 -I net1.1082 172.21.0.49
PING 172.21.0.49 (172.21.0.49) from 172.21.0.52 net1.1082: 56(84) bytes of data.
64 bytes from 172.21.0.49: icmp_seq=1 ttl=64 time=12.4 ms
64 bytes from 172.21.0.49: icmp_seq=2 ttl=64 time=0.931 ms
64 bytes from 172.21.0.49: icmp_seq=3 ttl=64 time=0.871 ms
64 bytes from 172.21.0.49: icmp_seq=4 ttl=64 time=0.761 ms
--- 172.21.0.49 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3054ms
rtt min/avg/max/mdev = 0.761/3.750/12.440/5.017 ms
root@test-router-69945bd84f-tb424:~# ping -c4 -I net1.1081 172.21.0.33
PING 172.21.0.33 (172.21.0.33) from 172.21.0.36 net1.1081: 56(84) bytes of data.
64 bytes from 172.21.0.33: icmp_seq=1 ttl=64 time=23.7 ms
64 bytes from 172.21.0.33: icmp_seq=2 ttl=64 time=0.777 ms
64 bytes from 172.21.0.33: icmp_seq=3 ttl=64 time=0.822 ms
64 bytes from 172.21.0.33: icmp_seq=4 ttl=64 time=31.3 ms
--- 172.21.0.33 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3054ms
rtt min/avg/max/mdev = 0.777/14.177/31.378/13.648 ms
However, we have issues with this approach if the VF is already assigned to a Pod. When the 802.1Q trunk policy is changed all tagged interfaces inside the Pod have to be recreated for new trunk policy to apply. This means that we either stop all traffic on that interface and recreate it or we must set the trunk policies in advance (before application deployment).
Setting trunk policy before provisioning a VF has undesired consequences:
root@worker:~# for VF in $(ls -l /sys/class/net/ens3f{0..1}/device/sriov/{0..7}/trunk | awk '{print $9}'); do echo add 1082 1082 > $VF; done
--- --- ---
$ kubectl create -f /home/deployment.yaml
$ kubectl -n test1 describe pod test-router-69945bd84f-5826r
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10s default-scheduler Successfully assigned test1/test-router-69945bd84f-5826r to worker
Warning NoNetworkFound 9s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 9s multus Add eth0 [192.168.199.199/32] from k8s-pod-network
Warning FailedCreatePodSandBox 9s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8be0ea041e5f00ec8cd4fe2756fc85b23ae5d691ceb4c1fd51afc9229ca6566a": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 8s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 8s multus Add eth0 [192.168.199.245/32] from k8s-pod-network
Warning FailedCreatePodSandBox 8s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "99d05c54d8c428663e99c72716b8fc525c134286a90f97bd23d584bf471a4e40": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 7s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 7s multus Add eth0 [192.168.199.250/32] from k8s-pod-network
Warning FailedCreatePodSandBox 7s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a89824b786e6cb41f6dee80931d5394ad55dc32939e4647a9ad56d418aa934a9": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 6s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 6s multus Add eth0 [192.168.199.197/32] from k8s-pod-network
Warning FailedCreatePodSandBox 6s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "51687216f7597bc2ad82e99c417c5337e847c6d11b9a4505a93d90aacc9344c7": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 5s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 5s multus Add eth0 [192.168.199.203/32] from k8s-pod-network
Warning FailedCreatePodSandBox 5s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4ea6cbf7c5338d2677c45b09af896be704887eaac82f5006c77ea42916f6247c": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 4s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 4s multus Add eth0 [192.168.199.208/32] from k8s-pod-network
Warning FailedCreatePodSandBox 4s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0259497093460581de8061b1a5002a8ee8e6241e8365c3c87b75d7bf228327b4": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 3s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 3s multus Add eth0 [192.168.199.229/32] from k8s-pod-network
Warning FailedCreatePodSandBox 3s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de1c1c1592364e813842b5717853dfc708a37277aac87581fae3e30779937814": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 2s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 2s multus Add eth0 [192.168.199.235/32] from k8s-pod-network
Warning FailedCreatePodSandBox 2s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f56958ff6acb07533ea2f36a4e809e9198d54fe2eacc71d0ac1a92adc824bb8b": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 1s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
Normal AddedInterface 1s multus Add eth0 [192.168.199.253/32] from k8s-pod-network
Warning FailedCreatePodSandBox 0s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2e41362ccdf0416b6f684299070cd2a8565b848fe63f6d70620d89b053f4c5a5": plugin type="multus" name="multus-cni-network" failed (add): [test1/test-router-69945bd84f-5826r:sriov-mlnx-connectx5]: error adding container to network "sriov-mlnx-connectx5": SRIOV-CNI failed to configure VF "failed to set vf 5 vlan: operation not permitted"
Warning NoNetworkFound 0s multus cannot find a network-attachment-definition (k8s-pod-network) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "k8s-pod-network" not found
This is true also for NETLINK commands:
root@worker:~# ip link set dev ens3f1 vf 3 vlan 0
RTNETLINK answers: Operation not permitted
Be mindful that we did not specify the VLAN ID in the NetworkAttachmentDefinition
. Inside the SR-IOV configuration reference the VLAN tag is set to 0 by default (disabled). We believe that this being set to 0 does not prevent the sriov-cni
from running this and this method trying to set 0 for VLAN tag and then failing, instead of skipping VLAN assignment and simply performing other operations (like changing network namespace of VF).
We tested this scenario manually by imposing an administrative trunk policy on all VFs, then we moved the VF interface into the Pod network namespace and provisioned a forbidden VLAN and an allowed VLAN. Firstly let us configure the administrative trunk policy on the VFs:
root@worker:~# for VF in $(ls -l /sys/class/net/ens3f{0..1}/device/sriov/{0..7}/trunk | awk '{print $9}'); do echo add 1070 1081 > $VF; done
root@worker:~# cat /sys/class/net/ens3f{0..1}/device/sriov/{0..7}/trunk
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Allowed 802.1Q VLANs: 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081
Secondly, we can create a Pod but this time without the SR-IOV NetworkAttachmentDefinition
. This is because we will manually assign the virtual function interface into the Pod network namespace:
$ kubectl -n test1 exec -it test-router-875898ffb-k6g25 -- bash
root@test-router-875898ffb-k6g25:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if5511: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 7e:33:bf:6b:41:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.208/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::7c33:bfff:fe6b:41a9/64 scope link
valid_lft forever preferred_lft forever
Let us go back on the host and try to find this network namespace.
root@worker:~# ip netns list
cni-cdd833b3-1de0-8abd-e4d3-d4daa5cb1e62 (id: 13)
cni-f69bf47a-0950-361f-db8b-c1e485d34b6a (id: 3)
cni-4382825b-a464-fd6c-400c-1c5de6b0683b (id: 0)
cni-27f1a091-e6c7-dbe4-341d-dd9cae23770f (id: 12)
cni-4d7f0916-3e9e-e2f9-1f34-41f0a2630373 (id: 9)
cni-c68fc5b8-2f15-6b8f-8ba3-7c8421301974 (id: 11)
cni-2874a9fa-ac9a-2271-2da4-336164e88248 (id: 10)
cni-8cd024a2-6a0e-7761-8199-e64c217d8d17 (id: 8)
cni-4fc915e4-2931-a6a0-12d3-3de92c0f6377 (id: 7)
cni-8da838ce-80a8-9e82-b747-f72df1f472a2 (id: 6)
cni-cbaad5ab-0609-74fe-2288-a1d52e780f3c (id: 5)
cni-d8c780b4-35d2-c331-c2b4-ebe1392a3774 (id: 4)
cni-7e77f723-caa5-ed63-7240-26c6b141b407 (id: 2)
cni-978bb2af-d6d7-82e8-6752-e3aef9a0b021 (id: 1)
root@worker:~# ip netns exec cni-cdd833b3-1de0-8abd-e4d3-d4daa5cb1e62 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if5511: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 7e:33:bf:6b:41:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.208/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::7c33:bfff:fe6b:41a9/64 scope link
valid_lft forever preferred_lft forever
We found the correct network namespace. Now we can move any VF interface into it; the VF interfaces are in format ens3f[0-1]v[0-7]
. Lets take ens3f1v5
for this example:
root@worker:~# ip link set ens3f1v5 netns cni-cdd833b3-1de0-8abd-e4d3-d4daa5cb1e62
root@worker:~# ip netns exec cni-cdd833b3-1de0-8abd-e4d3-d4daa5cb1e62 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if5511: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 7e:33:bf:6b:41:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.208/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::7c33:bfff:fe6b:41a9/64 scope link
valid_lft forever preferred_lft forever
25: ens3f1v5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 3a:97:8b:03:e1:da brd ff:ff:ff:ff:ff:ff
Perfect! We snatched an VF interface. Lets configure it as in the previous examples from inside the Pod:
root@test-router-875898ffb-k6g25:~# ip link add link ens3f1v5 name ens3f1v5.1082 type vlan id 1082
root@test-router-875898ffb-k6g25:~# ip addr add 172.21.0.52/28 dev ens3f1v5.1082
root@test-router-875898ffb-k6g25:~# ip link add link ens3f1v5 name ens3f1v5.1081 type vlan id 1081
root@test-router-875898ffb-k6g25:~# ip addr add 172.21.0.36/28 dev ens3f1v5.1081
root@test-router-875898ffb-k6g25:~# ip link set dev ens3f1v5 up
root@test-router-875898ffb-k6g25:~# ip link set dev ens3f1v5.1082 up
root@test-router-875898ffb-k6g25:~# ip link set dev ens3f1v5.1081 up
root@test-router-875898ffb-k6g25:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if5511: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 7e:33:bf:6b:41:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.208/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::7c33:bfff:fe6b:41a9/64 scope link
valid_lft forever preferred_lft forever
4: ens3f1v5.1082@ens3f1v5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 3a:97:8b:03:e1:da brd ff:ff:ff:ff:ff:ff
inet 172.21.0.52/28 scope global ens3f1v5.1082
valid_lft forever preferred_lft forever
inet6 fe80::3897:8bff:fe03:e1da/64 scope link
valid_lft forever preferred_lft forever
5: ens3f1v5.1081@ens3f1v5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 3a:97:8b:03:e1:da brd ff:ff:ff:ff:ff:ff
inet 172.21.0.36/28 scope global ens3f1v5.1081
valid_lft forever preferred_lft forever
inet6 fe80::3897:8bff:fe03:e1da/64 scope link
valid_lft forever preferred_lft forever
25: ens3f1v5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 3a:97:8b:03:e1:da brd ff:ff:ff:ff:ff:ff
inet6 fe80::3897:8bff:fe03:e1da/64 scope link
valid_lft forever preferred_lft forever
The final test is to perform a ping from these interfaces to confirm that our theory is indeed correct:
root@test-router-875898ffb-k6g25:~# ping -c4 -I ens3f1v5.1082 172.21.0.49
PING 172.21.0.49 (172.21.0.49) from 172.21.0.52 ens3f1v5.1082: 56(84) bytes of data.
--- 172.21.0.49 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3049ms
pipe 4
root@test-router-875898ffb-k6g25:~# ping -c4 -I ens3f1v5.1081 172.21.0.33
PING 172.21.0.33 (172.21.0.33) from 172.21.0.36 ens3f1v5.1081: 56(84) bytes of data.
64 bytes from 172.21.0.33: icmp_seq=1 ttl=64 time=14.7 ms
64 bytes from 172.21.0.33: icmp_seq=2 ttl=64 time=9.52 ms
64 bytes from 172.21.0.33: icmp_seq=3 ttl=64 time=0.868 ms
64 bytes from 172.21.0.33: icmp_seq=4 ttl=64 time=0.868 ms
--- 172.21.0.33 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3033ms
rtt min/avg/max/mdev = 0.868/6.494/14.712/5.917 ms
To cleanup we delete the VLAN interfaces and return the VF interface into the root network namespace:
root@worker:~# ip netns exec cni-cdd833b3-1de0-8abd-e4d3-d4daa5cb1e62 ip link set ens3f1v5 netns 1
root@worker:~# ip netns exec cni-cdd833b3-1de0-8abd-e4d3-d4daa5cb1e62 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if5511: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 7e:33:bf:6b:41:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.199.208/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::7c33:bfff:fe6b:41a9/64 scope link
valid_lft forever preferred_lft forever
LinkSetVfVlanQosProto
when VLAN ID is 0. NetConf
struct if Vlan is 0 and skipping the method.Please provide guidance on this matter!
@mlguerrero12 can you take a look on this one and let me know :)
@SchSeba, I think tt's a valid request.
We assign vlan 0 when the vlan parameter is not set. We could (and should) change this. The idea is to not set vlan 0 when the vlan parameter is not set and set vlan 0 when the config explicitly has the vlan parameter equal to 0.
For both cases, we will continue having the restriction of not allowing qos or vlan proto when vlan parameter is not set or 0.
I'll work on this.
@RefluxMeds, fix is in #296. Try to test it from your side please.
Hi @mlguerrero12 , I'm blocked with activities until next week. I'll dedicate some time next week to test out your changes! Thank you so much!
What issue would you like to bring attention to?
When using trunk policy on VF level, the sriov-cni fails allocating the VF to the Pod.
What is the impact of this issue?
Do you have a proposed response or remediation for the issue?
Either skip applying VLAN config when VLAN==0 in https://github.com/k8snetworkplumbingwg/sriov-cni/blob/master/pkg/sriov/sriov.go#L226 or Add additional configurable field for
sriov-cni
which would skip VLAN config when e.g.VFtrunk=true
If necessary I can post a longer post explaining all details of this.