Closed xhejtman closed 8 months ago
Hi @xhejtman I tried to reproduce your issue but my deployment is working.
Can you share more details on what you're doing? Especially, the manifest for the NetworkAttachmentDefinition and pod definition would be useful to reproduce.
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: ibnet-beegfs
spec:
config: '{
"cniVersion": "0.3.1",
"name": "ibnet",
"type": "ipoib",
"master": "ibp225s0",
"ipam": {
"type": "whereabouts",
"range": "10.16.59.0/24",
"range_start": "10.16.59.2",
"range_end": "10.16.59.2"
}
}'
this is infiniband device and a pool with a single address (emulating service for a statefulset).
and the statefulset
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: beegfs-mgmtd
spec:
replicas: 1
serviceName: beegfs
selector:
matchLabels:
app: beegfs-mgmtd
template:
metadata:
labels:
app: beegfs-mgmtd
annotations:
k8s.v1.cni.cncf.io/networks: "beegfs/ibnet-beegfs"
spec:
containers:
- name: beegfs-mgmtd
image: cerit.io/os/beegfs:7.3.3
command:
- /opt/beegfs/sbin/beegfs-mgmtd
- cfgFile=/etc/beegfs/beegfs-mgmtd.conf
- runDaemonized=false
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 999
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
resources:
limits:
rdma/hca: 1
volumeMounts:
- mountPath: /mnt
name: mgmt-dir
- name: connauth
mountPath: /etc/beegfs-auth
- name: config
mountPath: /etc/beegfs
securityContext:
fsGroup: 999
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
volumes:
- name: mgmt-dir
persistentVolumeClaim:
claimName: pvc-beegfs-mgmt
- name: connauth
secret:
secretName: connauthfile
- name: config
configMap:
name: mgmtd-conf
@xhejtman Thanks for the info. I still couldn't reproduce the problem as I don't have access to infiniband hardware.
I had no issue using multus 4 with sriov+dpdk or in a VM without special hardware.
It might be worth opening an issue upstream as I don't think we can help more without a way to reproduce the bug.
Hi @thomasferrandiz,
maybe I can help - I observed the same problem without using any special hardware. The behavior can be reproduced as follows:
After the reboot the longhorn instance-manager pods don't start and the log messages state that the adapter lhnet1 already exists.
This setup works fine using rke2 v1.26.6 with multus 3.9.3 or kubeadm 1.27.4 with multus 4.0.2.
I can confirm that I made similar - draining node and reboot before upgrade to 1.26.7, so maybe draining when address/interface is in use, is causing this.
@work-smalin Thanks for the details. I will check that.
I think I found a fix for the issue and submitted a PR upstream: https://github.com/k8snetworkplumbingwg/multus-cni/pull/1137
There is also an updated rancher's multus image available with the fix: rancher/hardened-multus-cni:v4.0.2-build20230811
Issue should be closed by QA once tested
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
Hello, although the issue was closed, it's still pressent in multus-cni. rancher/hardened-multus-cni:v4.0.2-build20230811 fixes it for me.
Edit: sorry, thought this issue was on the multus repo.
Environmental Info: RKE2 Version: v1.26.7+rke2r1
Node(s) CPU architecture, OS, and Version: Linux kub-b14.priv.cerit-sc.cz 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: cni: calico,multus
Describe the bug: If using additional network attachement, pod does not start:
It worked in the version v1.26.6+rke2r1, which has multus 3.x. Maybe related to: https://github.com/k8snetworkplumbingwg/multus-cni/issues/1130 ?