hustcat / sriov-cni

SR-IOV CNI plugin
Apache License 2.0
231 stars 49 forks source link

IP address is not in a configured pool #33

Closed hiyijian closed 6 years ago

hiyijian commented 6 years ago

Hi stuff, I come cross following error when starting network. K8S + calico have been already runing on my cluster. Any Idea ? Appreciated!

$ CNI_ARGS="IgnoreUnknown=1;IP=10.55.206.1" ./priv-net-run.sh ifconfig
...
...
2018-03-05 14:35:55.421 [INFO][10438] client.go 202: Loading config from environment
Calico CNI IPAM request IP: 10.55.206.1
2018-03-05 14:35:55.422 [INFO][10438] calico-ipam.go 125: Assigning provided IP assignArgs=client.AssignIPArgs{IP:net.IP{IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0x37, 0xce, 0x1}}, HandleID:(*string)(0xc42033c0a0), Attrs:map[string]string(nil), Hostname:"root0-PR4768GW-238"} handleID="k8s-pod-network.e34191347f065da" workloadID="e34191347f065da"
2018-03-05 14:35:55.422 [INFO][10438] ipam.go 296: Assigning IP 10.55.206.1 to host: root0-PR4768GW-238
2018-03-05 14:35:55.435 [INFO][10406] calico.go 171: Got result from IPAM plugin IPAM result=<nil> Workload="e34191347f065da"
k8s-pod-network : error executing ADD: The provided IP address is not in a configured pool
hustcat commented 6 years ago

It seems that IPAM allocate IP address failed

hiyijian commented 6 years ago

Yes. it is wield since I set ipam to "fixipam" but it seems that sriov cni use "calico-ipam" instead. Is there any relation bettween them?

root@root0-PR4768GW-238:/etc/cni/net.d# ls
10-calico.conf  10-rdmanet.conf  bak  calico-kubeconfig  calico-tls
root@root0-PR4768GW-238:/etc/cni/net.d# cat 10-rdmanet.conf 
{
    "name": "rdmanet",
    "type": "sriov",
    "master": "ib0",
    "pfOnly": false,
    "ipam": {
        "type": "fixipam",
        "subnet": "10.55.206.0/26",
        "routes": [
            { "dst": "0.0.0.0/0" }
        ],
        "gateway": "10.55.206.1"
    }
}
roo

When I stop calico service and remove 10.calico.conf, I got some other error from sriov cni

root@root0-PR4768GW-238:/usr/local/go/src/github.com/hustcat/sriov-cni/scripts# CNI_PATH=$CNI_PATH CNI_ARGS="IgnoreUnknown=1;IP=10.55.206.46;VF=1;MAC=66:d8:02:77:aa:aa" ./priv-net-run.sh ifconfig
contid=523d2e0f6b4f34a4
netnspath=/var/run/netns/523d2e0f6b4f34a4
rdmanet : error executing ADD: failed to open the virtfn1 dir of the device "ib0": lstat /sys/class/net/ib0/device/virtfn1/net: no such file or directory
hustcat commented 6 years ago

Maybe you should move 10-calico.conf. @hiyijian

hiyijian commented 6 years ago

When I stop calico service and remove 10.calico.conf, I got some other error from sriov cni

root@root0-PR4768GW-238:/usr/local/go/src/github.com/hustcat/sriov-cni/scripts# CNI_PATH=$CNI_PATH CNI_ARGS="IgnoreUnknown=1;IP=10.55.206.46;VF=1;MAC=66:d8:02:77:aa:aa" ./priv-net-run.sh ifconfig
contid=523d2e0f6b4f34a4
netnspath=/var/run/netns/523d2e0f6b4f34a4
rdmanet : error executing ADD: failed to open the virtfn1 dir of the device "ib0": lstat /sys/class/net/ib0/device/virtfn1/net: no such file or directory

and below is /sys/class/net/ib0/device/virtfn1

jianyi@root0-PR4768GW-238:~$ ls /sys/class/net/ib0/device/virtfn1
broken_parity_status      d3cold_allowed   enable         local_cpus  physfn    resource2     subsystem_device
class                     device           firmware_node  modalias    power     resource2_wc  subsystem_vendor
config                    dma_mask_bits    irq            msi_bus     reset     revision      uevent
consistent_dma_mask_bits  driver_override  local_cpulist  numa_node   resource  subsystem     vendor

@hustcat

hiyijian commented 6 years ago

It seems that it failed to enable virtual function, according to kernel message.

[    7.052904] mlx4_core: device is working in RoCE mode: Roce V1
[    7.052904] mlx4_core: UD QP Gid type is: V1
[    8.727397] mlx4_core 0000:01:00.0: DMFS high rate steer mode is: default performance
[    8.727603] mlx4_core 0000:01:00.0: Enabling SR-IOV with 63 VFs
[    8.834138] pci 0000:01:00.1: [15b3:1004] type 00 class 0x028000
[    8.840643] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 512)
[    8.843061] mlx4_core: Initializing 0000:01:00.1
[    8.843112] mlx4_core 0000:01:00.1: enabling device (0000 -> 0002)
[    8.843948] mlx4_core 0000:01:00.1: Skipping virtual function:1
[    8.844517] pci 0000:01:00.2: [15b3:1004] type 00 class 0x028000
[    8.851059] pci 0000:01:00.2: Max Payload Size set to 256 (was 128, max 512)

can you please to help?

hustcat commented 6 years ago

@hiyijian Can you give more detailed log? From the begining which kernel load the mlx4_core module.

hiyijian commented 6 years ago

Ok. Thanks. see dmesg.txt here is a more detail doc for the issue: sriov VF enable failed.docx

hustcat commented 6 years ago

@hiyijian Can you show me the module config file, such as /etc/modprobe.d/mlx4_core.conf.

hiyijian commented 6 years ago
$ cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core num_vfs=63 port_type_array=1,1 probe_vf=1
hustcat commented 6 years ago

Please replace it with options mlx4_core port_type_array=2,2 num_vfs=0,4,0 probe_vf=0,4,0 and try again. @hiyijian

hiyijian commented 6 years ago

Thanks @hustcat . I realized the problem is a little bit complex. The engineer of mellanox has already engaged to solve our problem. I will let you know when it done.