Closed superfix906 closed 4 years ago
To add some more information, I tried similar with dpdk's 'igb_uio' driver and was able to make it work, the dummy interface creation was successful, unlike the case of 'vfio-pci'. So this issue is specifically for devices bound to 'vfio-pci' driver alone. Any help on this will be appreciated, as vfio-pci is the way we want to move ahead. Thanks in advance !
interesting issue cause i explicitly tested the scenario and it was working for me :) so actually the error you see is coming from here: https://github.com/nokia/danm/blob/181255ef463e930aa221aecd9f50f014a4e760b2/pkg/danmep/ep.go#L285
at this point the IP address is actually not yet added to the link, we only set its MAC address! so the error is 1: coming from the kernel 2: must be MAC clash related, not IP
I can only think of two things why this can happen:
but TBH my money is on the old kernel
@superfix906 so managed to retest this recently with 82599 NICs (which model you are using BTW?), on CentOS 7.8 with 4.18 kernel it works fine in all scenarios, with or without VLAN tag in the network. but one thing I noticed when VLAN is also used in the network we add the VF MAC address to both the dummy, and the VLAN interface on top of it my kernel could tolerate it, but maybe the older ones could not? I made this change to address it: https://github.com/nokia/danm/pull/234 , but as you did not use VLAN tag in your network this is prob not the root cause
in any case, we did encounter such an error you describe in our evnrionment, but it only happened when DANM was asked to work with improperly setup VFs (binding to VFIO was not properly done before the Pod was created) Considering the feature can be reliably used in our environment I strongly think the root cause is environment specific, and possibly related to either your kernel, or to improper device management in the host layer
further debugged the problem. the error possibly appears when the MAC address of the VF is full zero. the kernel refuses to set it on the dummy interface this can happen with some Intel drivers. the referenced PR now adds check for zero MAC, and only tries to set it on the dummy if it is a valid one, which should solve the problem it is currently unclear whether the Intel drivers zero out both admin and effective MACs, or in some cases the SR-IOV CNI fails to properly reset the VF after use, because I did observe VFIO bound VFs to sometimes have MAC addresses, and sometimes don't. So it is still kind of a mystery, but nevertheless whatever happens on the host level DANM will now behave more resilient :)
@Levovar Thanks a lot for the detailed research and inputs. Appreciate that !
Unfortunately, we have digressed from this at the moment. Shall update once we back at this again. Thanks again.
@superfix906 no problem :) meanwhile we have tested the change in our own environment, and it solves the reported problem so I will close the ticket
thanks again for reporting the case!
we found setting the mac address apriori as part of node/device setup is better than leaving it zero mac. This prevents creation of random mac when DPDK enumerates the VFs for some models.
Is this a BUG REPORT or FEATURE REQUEST?:
What happened:
What you expected to happen:
How to reproduce it:
Anything else we need to know?:
Environment:
DANM version (use
danm -version
): v4.2.0, commit: c0a4c157Kubernetes version (use
kubectl version
): v1.18.6DANM configuration (K8s manifests, kubeconfig files, CNI config file):
/etc/cni/net.d/00-danm.conf
/etc/cni/net.d/10-flannel.conf
kubeadm config view
OS (e.g. from /etc/os-release): CentOS Linux release 7.7.1908 (Core)
Kernel (e.g.
uname -a
): 3.10.0-1062.18.1.rt56.1044.el7.x86_64Others: