intel / afxdp-plugins-for-kubernetes

Apache License 2.0
44 stars 16 forks source link

[Listen] Listener timed out: accept unixpacket #55

Closed Alt-Shivam closed 1 year ago

Alt-Shivam commented 1 year ago

Hey, I'm trying to deploy a pod with 2 af-xdp interaces: and facing this issue:

INFO[2023-04-20 18:09:16] Reading config file: /afxdp/config/config.json 
INFO[2023-04-20 18:09:16] Unmarshalling config data                    
INFO[2023-04-20 18:09:16] Config Data:
{
  "Pools": [
    {
      "Name": "access",
      "Mode": "primary",
      "Drivers": [
        {
          "Name": "ice",
          "Primary": 0,
          "Secondary": 0,
          "ExcludeDevices": [
            {
              "Name": "ens4f1",
              "Pci": "",
              "Mac": "",
              "Secondary": 0
            }
          ],
          "ExcludeAddressed": false
        }
      ],
      "Devices": null,
      "Nodes": null,
      "UdsServerDisable": false,
      "UdsTimeout": 0,
      "UdsFuzz": false,
      "RequiresUnprivilegedBpf": false,
      "uid": 0,
      "ethtoolCmds": null
    },
    {
      "Name": "core",
      "Mode": "primary",
      "Drivers": [
        {
          "Name": "ice",
          "Primary": 0,
          "Secondary": 0,
          "ExcludeDevices": [
            {
              "Name": "ens4f0",
              "Pci": "",
              "Mac": "",
              "Secondary": 0
            }
          ],
          "ExcludeAddressed": false
        }
      ],
      "Devices": null,
      "Nodes": null,
      "UdsServerDisable": false,
      "UdsTimeout": 0,
      "UdsFuzz": false,
      "RequiresUnprivilegedBpf": false,
      "uid": 0,
      "ethtoolCmds": null
    }
  ],
  "LogFile": "afxdp-dp.log",
  "LogLevel": "debug"
} 
INFO[2023-04-20 18:09:16] Validating config data                       
INFO[2023-04-20 18:09:16] Setting log directory: /var/log/afxdp-k8s-plugins/ 
INFO[2023-04-20 18:09:16] Setting log file: afxdp-dp.log               
INFO[2023-04-20 18:09:16] Setting log level: debug                     
INFO[2023-04-20 18:09:16] Switching to debug log format                
INFO[2023-04-20 18:09:16] [main.go:75] [main] Starting AF_XDP Device Plugin                
INFO[2023-04-20 18:09:16] [main.go:78] [main] Checking if host meets requirements          
DEBU[2023-04-20 18:09:16] [main.go:171] [checkHost] Checking kernel version                      
DEBU[2023-04-20 18:09:16] [main.go:197] [checkHost] Kernel version: 5.13.0-1009-oem meets minimum requirements 
DEBU[2023-04-20 18:09:16] [main.go:200] [checkHost] Checking host for Libbpf                     
DEBU[2023-04-20 18:09:16] [host.go:85] [HasLibbpf] Directory /usr/lib64/ does not exist         
DEBU[2023-04-20 18:09:16] [main.go:207] [checkHost] Libbpf found on host:                        
DEBU[2023-04-20 18:09:16] [main.go:209] [checkHost]     /usr/lib/libbpf.so.0                        
DEBU[2023-04-20 18:09:16] [main.go:209] [checkHost]     /usr/lib/libbpf.so.0.5.0                    
INFO[2023-04-20 18:09:16] [main.go:88] [main] Host meets requirements                      
INFO[2023-04-20 18:09:16] [main.go:91] [main] Getting device pools                         
DEBU[2023-04-20 18:09:16] [config.go:111] [GetPoolConfigs] Unprivileged BPF is allowed on this host     
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] docker0 is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] calibfc702d80c1 is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:135] [GetPoolConfigs] eno1 a globally prohibited device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] calibd8ae8df0e5 is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] data is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] cali09d96a0e9e0 is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] lo is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:135] [GetPoolConfigs] eno2 a globally prohibited device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] cali537aa6a676c is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] calie47cf898f7c is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] worker is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] vxlan.calico is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:130] [GetPoolConfigs] cali561dddf72ed is not a physical device, removing from list of host devices 
DEBU[2023-04-20 18:09:16] [config.go:145] [GetPoolConfigs] Host devices:
{
  "ens4f0": {},
  "ens4f1": {}
} 
INFO[2023-04-20 18:09:16] [config.go:149] [GetPoolConfigs] Processing Pool: access                      
DEBU[2023-04-20 18:09:16] [config.go:163] [GetPoolConfigs] Using default UDS timeout: 30 seconds        
INFO[2023-04-20 18:09:16] [config.go:264] [getDeviceListOfDriverType] ens4f0 added to pool                         
DEBU[2023-04-20 18:09:16] [config.go:332] [validateDevice] ens4f1 is an excluded device for ice driver  
DEBU[2023-04-20 18:09:16] [config.go:273] [getDeviceListOfDriverType] Exit discovery.                              
INFO[2023-04-20 18:09:16] [config.go:149] [GetPoolConfigs] Processing Pool: core                        
DEBU[2023-04-20 18:09:16] [config.go:163] [GetPoolConfigs] Using default UDS timeout: 30 seconds        
DEBU[2023-04-20 18:09:16] [config.go:316] [validateDevice] Device ens4f0 is fully assigned              
INFO[2023-04-20 18:09:16] [config.go:264] [getDeviceListOfDriverType] ens4f1 added to pool                         
DEBU[2023-04-20 18:09:16] [config.go:273] [getDeviceListOfDriverType] Exit discovery.                              
DEBU[2023-04-20 18:09:16] [poolManager.go:327] [startGRPC] afxdp/access started serving on /var/lib/kubelet/device-plugins/afxdp-access.sock 
INFO[2023-04-20 18:09:16] [poolManager.go:88] [Init] Pool afxdp/access started serving            
INFO[2023-04-20 18:09:16] [poolManager.go:93] [Init] Pool afxdp/access registered with Kubelet    
DEBU[2023-04-20 18:09:16] [poolManager.go:123] [ListAndWatch] Pool afxdp/access ListAndWatch started       
DEBU[2023-04-20 18:09:16] [poolManager.go:327] [startGRPC] afxdp/core started serving on /var/lib/kubelet/device-plugins/afxdp-core.sock 
INFO[2023-04-20 18:09:16] [poolManager.go:88] [Init] Pool afxdp/core started serving              
INFO[2023-04-20 18:09:16] [poolManager.go:93] [Init] Pool afxdp/core registered with Kubelet      
DEBU[2023-04-20 18:09:16] [poolManager.go:123] [ListAndWatch] Pool afxdp/core ListAndWatch started         
DEBU[2023-04-20 18:21:06] [poolManager.go:152] [Allocate] New allocate request on pool access          
INFO[2023-04-20 18:21:06] [poolManager.go:155] [Allocate] Creating new UDS server                      
DEBU[2023-04-20 18:21:06] [poolManager.go:180] [Allocate] Device: {
  "Name": "ens4f0",
  "Mode": "primary",
  "Driver": "ice",
  "Pci": "0000:d8:00.0",
  "MacAddress": "3c:ec:ef:d9:62:f2",
  "FullyAssigned": true,
  "EthtoolFilters": null,
  "Primary": {
    "Name": "ens4f0",
    "Mode": "primary",
    "Driver": "ice",
    "Pci": "0000:d8:00.0",
    "MacAddress": "3c:ec:ef:d9:62:f2",
    "FullyAssigned": true,
    "EthtoolFilters": null,
    "Primary": null
  }
} 
DEBU[2023-04-20 18:21:06] [poolManager.go:190] [Allocate] Primary mode                                 
DEBU[2023-04-20 18:21:06] [poolManager.go:202] [Allocate] Cycling state of device ens4f0               
INFO[2023-04-20 18:21:06] [poolManager.go:209] [Allocate] Loading BPF program on device: ens4f0        
INFO[2023-04-20 18:21:06] [bpfWrapper.go:104] [Infof] Load_bpf_send_xsk_map: disovering if_index for interface ens4f0 
INFO[2023-04-20 18:21:06] [bpfWrapper.go:104] [Infof] Load_bpf_send_xsk_map: if_index for interface ens4f0 is 4 
INFO[2023-04-20 18:21:06] [bpfWrapper.go:104] [Infof] Load_bpf_send_xsk_map: starting setup of xdp program on interface ens4f0 (4) 
INFO[2023-04-20 18:21:06] [bpfWrapper.go:104] [Infof] Load_bpf_send_xsk_map: loaded xdp program on interface ens4f0 (4), file descriptor 11 
INFO[2023-04-20 18:21:06] [poolManager.go:215] [Allocate] BPF program loaded on: ens4f0 File descriptor: 11 
DEBU[2023-04-20 18:21:06] [poolManager.go:233] [Allocate] Container environment variables: {
  "AFXDP_DEVICES": "ens4f0"
} 
DEBU[2023-04-20 18:21:06] [udsserver.go:166] [start] Initialising Unix domain socket: /tmp/afxdp_dp/afxdp_access/13ab88a3-4751-40fb-a9b5-2b3118180c4e.sock 
INFO[2023-04-20 18:21:06] [udsserver.go:174] [start] Unix domain socket initialised. Listening for new connection. 
DEBU[2023-04-20 18:21:06] [poolManager.go:152] [Allocate] New allocate request on pool core            
INFO[2023-04-20 18:21:06] [poolManager.go:155] [Allocate] Creating new UDS server                      
DEBU[2023-04-20 18:21:06] [poolManager.go:180] [Allocate] Device: {
  "Name": "ens4f1",
  "Mode": "primary",
  "Driver": "ice",
  "Pci": "0000:d8:00.1",
  "MacAddress": "3c:ec:ef:d9:62:f3",
  "FullyAssigned": true,
  "EthtoolFilters": null,
  "Primary": {
    "Name": "ens4f1",
    "Mode": "primary",
    "Driver": "ice",
    "Pci": "0000:d8:00.1",
    "MacAddress": "3c:ec:ef:d9:62:f3",
    "FullyAssigned": true,
    "EthtoolFilters": null,
    "Primary": null
  }
} 
DEBU[2023-04-20 18:21:06] [poolManager.go:190] [Allocate] Primary mode                                 
DEBU[2023-04-20 18:21:06] [poolManager.go:202] [Allocate] Cycling state of device ens4f1               
INFO[2023-04-20 18:21:06] [poolManager.go:209] [Allocate] Loading BPF program on device: ens4f1        
INFO[2023-04-20 18:21:06] [bpfWrapper.go:104] [Infof] Load_bpf_send_xsk_map: disovering if_index for interface ens4f1 
INFO[2023-04-20 18:21:06] [bpfWrapper.go:104] [Infof] Load_bpf_send_xsk_map: if_index for interface ens4f1 is 5 
INFO[2023-04-20 18:21:06] [bpfWrapper.go:104] [Infof] Load_bpf_send_xsk_map: starting setup of xdp program on interface ens4f1 (5) 
INFO[2023-04-20 18:21:06] [bpfWrapper.go:104] [Infof] Load_bpf_send_xsk_map: loaded xdp program on interface ens4f1 (5), file descriptor 15 
INFO[2023-04-20 18:21:06] [poolManager.go:215] [Allocate] BPF program loaded on: ens4f1 File descriptor: 15 
DEBU[2023-04-20 18:21:06] [poolManager.go:233] [Allocate] Container environment variables: {
  "AFXDP_DEVICES": "ens4f1"
} 
DEBU[2023-04-20 18:21:06] [udsserver.go:166] [start] Initialising Unix domain socket: /tmp/afxdp_dp/afxdp_core/36add699-c5fb-4edb-9609-68c4d6f7a0ee.sock 
INFO[2023-04-20 18:21:06] [udsserver.go:174] [start] Unix domain socket initialised. Listening for new connection. 
ERRO[2023-04-20 18:21:36] [uds.go:134] [Listen] Listener timed out: accept unixpacket /tmp/afxdp_dp/afxdp_access/13ab88a3-4751-40fb-a9b5-2b3118180c4e.sock: i/o timeout 
ERRO[2023-04-20 18:21:36] [udsserver.go:179] [start] Listener timed out: accept unixpacket /tmp/afxdp_dp/afxdp_access/13ab88a3-4751-40fb-a9b5-2b3118180c4e.sock: i/o timeout 
DEBU[2023-04-20 18:21:36] [uds.go:298] [cleanup] Closing Unix listener                        
DEBU[2023-04-20 18:21:36] [uds.go:304] [cleanup] Closing socket file                          
DEBU[2023-04-20 18:21:36] [uds.go:306] [cleanup] Removing socket file                         
ERRO[2023-04-20 18:21:36] [uds.go:134] [Listen] Listener timed out: accept unixpacket /tmp/afxdp_dp/afxdp_core/36add699-c5fb-4edb-9609-68c4d6f7a0ee.sock: i/o timeout 
ERRO[2023-04-20 18:21:36] [udsserver.go:179] [start] Listener timed out: accept unixpacket /tmp/afxdp_dp/afxdp_core/36add699-c5fb-4edb-9609-68c4d6f7a0ee.sock: i/o timeout 
DEBU[2023-04-20 18:21:36] [uds.go:298] [cleanup] Closing Unix listener                        
DEBU[2023-04-20 18:21:36] [uds.go:304] [cleanup] Closing socket file                          
DEBU[2023-04-20 18:21:36] [uds.go:306] [cleanup] Removing socket file

daemonset file:

apiVersion: v1
kind: ConfigMap
metadata:
  name: afxdp-dp-config
  namespace: kube-system
data:
  config.json: |
    {
        "logLevel":"debug",
        "logFile":"afxdp-dp.log",
        "pools":[
          {
             "name":"access",
             "mode":"primary",
             "drivers":[
                {
                   "name":"ice",
                   "excludeDevices":[
                    {
                        "name":"ens4f1"
                    }
                    ]
                }
             ]
          },
          {
             "name":"core",
             "mode":"primary",
             "drivers":[
                {
                   "name":"ice",
                   "excludeDevices":[
                    {
                        "name":"ens4f0"
                    }
                    ]
                }
             ]
          }
       ]
    }

dpdk-devbind.py -s:

Network devices using kernel driver
===================================
0000:60:00.0 'Ethernet Connection X722 for 1GbE 37d1' if=eno1 drv=i40e unused=vfio-pci *Active*
0000:60:00.1 'Ethernet Connection X722 for 1GbE 37d1' if=eno2 drv=i40e unused=vfio-pci 
0000:d8:00.0 'Device 159b' if= drv=ice unused=vfio-pci 
0000:d8:00.1 'Device 159b' if= drv=ice unused=vfio-pci 

cc @garyloug

aeliusrs commented 1 year ago

Got the same error with ubuntu 22.04 and a simpler configuration in a vagrant machine

the pod is running with securityContext privileged: True.

garyloug commented 1 year ago

Hi @Alt-Shivam , @aeliusrs ,

INFO[2023-04-20 18:21:06] [udsserver.go:174] [start] Unix domain socket initialised. Listening for new connection. 
ERRO[2023-04-20 18:21:36] [uds.go:134] [Listen] Listener timed out: accept unixpacket /tmp/afxdp_dp/afxdp_access/13ab88a3-4751-40fb-a9b5-2b3118180c4e.sock: i/o timeout

You'll see the UDS server starts listening on the socket and then exactly 30 seconds later times out. This is by design.

Each container requesting AF_XDP network results in spinning up of a little UDS server (Go routine) to perform the XSK handshake. It's due to the way device plugins works. At the Allocate() stage, the device plugin has no idea which pod it is allocating the device(s) to, so the UDS server initially has no idea which pod it is serving. It cannot watch for a particular pod.

In the case of failed pod, we don't want these UDS servers to stay listening forever, creating more and more Go routines. So we rely on activity on the UDS to tell us if the pod started successfully. If no message comes over the UDS by a certain timeout, we assume failed pod and the UDS server shuts down. By default, say for production environments, this timeout is 30 seconds, but it is configurable up to 5min. "udsTimeout":300,

Also, we don't really advertise this in the documentation, or recommend it for production, but for debugging or developing you can disable this timeout altogether with a -1 value. Just be aware of the servers mentioned above. They will stay running forever until the handshake occurs and you finish the handshake with a /fin.

KR, Gary

Alt-Shivam commented 1 year ago

Thanks @garyloug @aeliusrs for the comments. Got the idea.