Open 316953425 opened 7 months ago
@ljkiraly Could you please take a look if you have a chance?
Hi @316953425 ,
Is that a new configuration? Has that worked before in this or another environment?
Could you try without specifying the service domain?
Please attach the forwarder logs to the issue. (kubectl logs forwarder-ovs-dp8qz -n nsm-system
)
Hi @316953425 , Is that a new configuration? Has that worked before in this or another environment? Could you try without specifying the service domain? Please attach the forwarder logs to the issue. (
kubectl logs forwarder-ovs-dp8qz -n nsm-system
)
hi @glazychev-art @ljkiraly
Is that a new configuration? Has that worked before in this or another environment?
I am using the configuration of version 1.11.1(https://github.com/networkservicemesh/deployments-k8s/tree/release/v1.11.1/examples/ovs) without any changes. This is my first installation and I have never installed it before.
Could you try without specifying the service domain?
My install step(use v1.11.1 version): 1、install spire:kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/spire/single_cluster?ref=v1.11.1 2、kubectl apply -f https://raw.githubusercontent.com/networkservicemesh/deployments-k8s/v1.11.1/examples/spire/single_cluster/clusterspiffeid-template.yaml 3、install ovs:kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/ovs?ref=v1.11.1 without specifying the service domain
cat /var/lib/networkservicemesh/smartnic.config
physicalFunctions:
0000:3d:00.0:
pfKernelDriver: i40e
vfKernelDriver: i40evf
capabilities:
- intel
- 1G
Please attach the forwarder logs to the issue. (kubectl logs forwarder-ovs-dp8qz -n nsm-system)
[root@CNCP-MS-01 tmp]# kubectl logs forwarder-ovs-vlbt9 -n nsm-system
I1220 01:28:03.231538 15220 ovs.go:98] Maximum command line arguments set to: 191102
Dec 20 01:28:03.232 [INFO] Setting env variable DLV_LISTEN_FORWARDER to a valid dlv '--listen' value will cause the dlv debugger to execute this binary and listen as directed.
2023/12/20 01:28:03 [INFO] there are 5 phases which will be executed followed by a success message:
2023/12/20 01:28:03 [INFO] the phases include:
2023/12/20 01:28:03 [INFO] 1: get config from environment
2023/12/20 01:28:03 [INFO] 2: ensure ovs is running
2023/12/20 01:28:03 [INFO] 3: retrieve spiffe svid
2023/12/20 01:28:03 [INFO] 4: create ovs forwarder network service endpoint
2023/12/20 01:28:03 [INFO] 5: create grpc server and register ovsxconnect
2023/12/20 01:28:03 [INFO] 6: register ovs forwarder network service with the registry
2023/12/20 01:28:03 [INFO] a final success message with start time duration
2023/12/20 01:28:03 [INFO] executing phase 1: get config from environment (time since start: 65.658µs)
This application is configured via the environment. The following environment
variables can be used:
KEY TYPE DEFAULT REQUIRED DESCRIPTION
NSM_NAME String forwarder Name of Endpoint
NSM_LABELS Comma-separated list of String:String pairs p2p:true Labels related to this forwarder-vpp instance
NSM_NSNAME String forwarder Name of Network Service to Register with Registry
NSM_BRIDGENAME String br-nsm Name of the OvS bridge
NSM_TUNNEL_IP String IP or CIDR to use for tunnels
NSM_CONNECT_TO URL unix:///connect.to.socket url to connect to
NSM_DIAL_TIMEOUT Duration 50ms Timeout for the dial the next endpoint
NSM_MAX_TOKEN_LIFETIME Duration 24h maximum lifetime of tokens
NSM_REGISTRY_CLIENT_POLICIES Comma-separated list of String etc/nsm/opa/common/.*.rego,etc/nsm/opa/registry/.*.rego,etc/nsm/opa/client/.*.rego paths to files and directories that contain registry client policies
NSM_RESOURCE_POLL_TIMEOUT Duration 30s device plugin polling timeout
NSM_DEVICE_PLUGIN_PATH String /var/lib/kubelet/device-plugins/ path to the device plugin directory
NSM_POD_RESOURCES_PATH String /var/lib/kubelet/pod-resources/ path to the pod resources directory
NSM_SRIOV_CONFIG_FILE String pci.config PCI resources config path
NSM_L2_RESOURCE_SELECTOR_FILE String config file for resource to label matching
NSM_PCI_DEVICES_PATH String /sys/bus/pci/devices path to the PCI devices directory
NSM_PCI_DRIVERS_PATH String /sys/bus/pci/drivers path to the PCI drivers directory
NSM_CGROUP_PATH String /host/sys/fs/cgroup/devices path to the host cgroup directory
NSM_VFIO_PATH String /host/dev/vfio path to the host VFIO directory
NSM_LOG_LEVEL String INFO Log level
NSM_OPENTELEMETRYENDPOINT String otel-collector.observability.svc.cluster.local:4317 OpenTelemetry Collector Endpoint
NSM_METRICS_EXPORT_INTERVAL Duration 10s interval between mertics exports
2023/12/20 01:28:03 [INFO] Config: &main.Config{Name:"forwarder-ovs-vlbt9", Labels:map[string]string{"p2p":"true"}, NSName:"forwarder", BridgeName:"br-nsm", TunnelIP:"172.16.102.11", ConnectTo:url.URL{Scheme:"unix", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/var/lib/networkservicemesh/nsm.io.sock", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}, DialTimeout:50000000, MaxTokenLifetime:86400000000000, RegistryClientPolicies:[]string{"etc/nsm/opa/common/.*.rego", "etc/nsm/opa/registry/.*.rego", "etc/nsm/opa/client/.*.rego"}, ResourcePollTimeout:30000000000, DevicePluginPath:"/var/lib/kubelet/device-plugins/", PodResourcesPath:"/var/lib/kubelet/pod-resources/", SRIOVConfigFile:"/var/lib/networkservicemesh/smartnic.config", L2ResourceSelectorFile:"", PCIDevicesPath:"/sys/bus/pci/devices", PCIDriversPath:"/sys/bus/pci/drivers", CgroupPath:"/host/sys/fs/cgroup/devices", VFIOPath:"/host/dev/vfio", LogLevel:"INFO", OpenTelemetryEndpoint:"otel-collector.observability.svc.cluster.local:4317", MetricsExportInterval:10000000000}
2023/12/20 01:28:03 [INFO] [duration:2.665071ms] completed phase 1: get config from environment
2023/12/20 01:28:03 [INFO] executing phase 2: ensure ovs is running (time since start: 2.758614ms)
2023/12/20 01:28:04 [INFO] local ovs is being used
2023/12/20 01:28:04 [INFO] [duration:1.262604272s] completed phase 2: ensure ovs is running
2023/12/20 01:28:04 [INFO] executing phase 3: retrieving svid, check spire agent logs if this is the last line you see (time since start: 1.265393546s)
Dec 20 01:28:04.541 [INFO] SVID: "spiffe://k8s.nsm/ns/nsm-system/pod/forwarder-ovs-vlbt9"
2023/12/20 01:28:04 [INFO] [duration:43.303477ms] completed phase 3: retrieving svid
2023/12/20 01:28:04 [INFO] executing phase 4: create ovsxconnect network service endpoint (time since start: 1.308742579s)
Dec 20 01:28:04.542 [FATA] error configuring forwarder endpoint: 0000:3d:00.0 has no ServiceDomains set; github.com/networkservicemesh/sdk-sriov/pkg/sriov/config.ReadConfig; /go/pkg/mod/github.com/networkservicemesh/sdk-sriov@v1.11.1/pkg/sriov/config/config.go:117; main.createSriovInterposeEndpoint; /build/main.go:354; main.createInterposeEndpoint; /build/main.go:319; main.main; /build/main.go:197; runtime.main; /usr/local/go/src/runtime/proc.go:250; runtime.goexit; /usr/local/go/src/runtime/asm_amd64.s:1598;
thanks
with the service domain
physicalFunctions:
0000:3d:00.0:
pfKernelDriver: i40e
vfKernelDriver: i40evf
capabilities:
- intel
- 1G
serviceDomains:
- worker.domain
[root@CNCP-MS-01 deployments-k8s-release-v1.11.1]# kubectl logs forwarder-ovs-b4jc2 -n nsm-system
I1220 01:58:07.267487 33840 ovs.go:98] Maximum command line arguments set to: 191102
Dec 20 01:58:07.268 [INFO] Setting env variable DLV_LISTEN_FORWARDER to a valid dlv '--listen' value will cause the dlv debugger to execute this binary and listen as directed.
2023/12/20 01:58:07 [INFO] there are 5 phases which will be executed followed by a success message:
2023/12/20 01:58:07 [INFO] the phases include:
2023/12/20 01:58:07 [INFO] 1: get config from environment
2023/12/20 01:58:07 [INFO] 2: ensure ovs is running
2023/12/20 01:58:07 [INFO] 3: retrieve spiffe svid
2023/12/20 01:58:07 [INFO] 4: create ovs forwarder network service endpoint
2023/12/20 01:58:07 [INFO] 5: create grpc server and register ovsxconnect
2023/12/20 01:58:07 [INFO] 6: register ovs forwarder network service with the registry
2023/12/20 01:58:07 [INFO] a final success message with start time duration
2023/12/20 01:58:07 [INFO] executing phase 1: get config from environment (time since start: 67.346µs)
This application is configured via the environment. The following environment
variables can be used:
KEY TYPE DEFAULT REQUIRED DESCRIPTION
NSM_NAME String forwarder Name of Endpoint
NSM_LABELS Comma-separated list of String:String pairs p2p:true Labels related to this forwarder-vpp instance
NSM_NSNAME String forwarder Name of Network Service to Register with Registry
NSM_BRIDGENAME String br-nsm Name of the OvS bridge
NSM_TUNNEL_IP String IP or CIDR to use for tunnels
NSM_CONNECT_TO URL unix:///connect.to.socket url to connect to
NSM_DIAL_TIMEOUT Duration 50ms Timeout for the dial the next endpoint
NSM_MAX_TOKEN_LIFETIME Duration 24h maximum lifetime of tokens
NSM_REGISTRY_CLIENT_POLICIES Comma-separated list of String etc/nsm/opa/common/.*.rego,etc/nsm/opa/registry/.*.rego,etc/nsm/opa/client/.*.rego paths to files and directories that contain registry client policies
NSM_RESOURCE_POLL_TIMEOUT Duration 30s device plugin polling timeout
NSM_DEVICE_PLUGIN_PATH String /var/lib/kubelet/device-plugins/ path to the device plugin directory
NSM_POD_RESOURCES_PATH String /var/lib/kubelet/pod-resources/ path to the pod resources directory
NSM_SRIOV_CONFIG_FILE String pci.config PCI resources config path
NSM_L2_RESOURCE_SELECTOR_FILE String config file for resource to label matching
NSM_PCI_DEVICES_PATH String /sys/bus/pci/devices path to the PCI devices directory
NSM_PCI_DRIVERS_PATH String /sys/bus/pci/drivers path to the PCI drivers directory
NSM_CGROUP_PATH String /host/sys/fs/cgroup/devices path to the host cgroup directory
NSM_VFIO_PATH String /host/dev/vfio path to the host VFIO directory
NSM_LOG_LEVEL String INFO Log level
NSM_OPENTELEMETRYENDPOINT String otel-collector.observability.svc.cluster.local:4317 OpenTelemetry Collector Endpoint
NSM_METRICS_EXPORT_INTERVAL Duration 10s interval between mertics exports
2023/12/20 01:58:07 [INFO] Config: &main.Config{Name:"forwarder-ovs-b4jc2", Labels:map[string]string{"p2p":"true"}, NSName:"forwarder", BridgeName:"br-nsm", TunnelIP:"172.16.102.11", ConnectTo:url.URL{Scheme:"unix", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/var/lib/networkservicemesh/nsm.io.sock", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}, DialTimeout:50000000, MaxTokenLifetime:86400000000000, RegistryClientPolicies:[]string{"etc/nsm/opa/common/.*.rego", "etc/nsm/opa/registry/.*.rego", "etc/nsm/opa/client/.*.rego"}, ResourcePollTimeout:30000000000, DevicePluginPath:"/var/lib/kubelet/device-plugins/", PodResourcesPath:"/var/lib/kubelet/pod-resources/", SRIOVConfigFile:"/var/lib/networkservicemesh/smartnic.config", L2ResourceSelectorFile:"", PCIDevicesPath:"/sys/bus/pci/devices", PCIDriversPath:"/sys/bus/pci/drivers", CgroupPath:"/host/sys/fs/cgroup/devices", VFIOPath:"/host/dev/vfio", LogLevel:"INFO", OpenTelemetryEndpoint:"otel-collector.observability.svc.cluster.local:4317", MetricsExportInterval:10000000000}
2023/12/20 01:58:07 [INFO] [duration:2.641675ms] completed phase 1: get config from environment
2023/12/20 01:58:07 [INFO] executing phase 2: ensure ovs is running (time since start: 2.745275ms)
2023/12/20 01:58:08 [INFO] local ovs is being used
2023/12/20 01:58:08 [INFO] [duration:1.270789462s] completed phase 2: ensure ovs is running
2023/12/20 01:58:08 [INFO] executing phase 3: retrieving svid, check spire agent logs if this is the last line you see (time since start: 1.273562642s)
Dec 20 01:58:08.585 [INFO] SVID: "spiffe://k8s.nsm/ns/nsm-system/pod/forwarder-ovs-b4jc2"
2023/12/20 01:58:08 [INFO] [duration:43.877516ms] completed phase 3: retrieving svid
2023/12/20 01:58:08 [INFO] executing phase 4: create ovsxconnect network service endpoint (time since start: 1.317484424s)
Dec 20 01:58:08.586 [INFO] [Config:ReadConfig] unmarshalled Config: &{PhysicalFunctions:map[0000:3d:00.0:&{PFKernelDriver:i40e VFKernelDriver:i40evf Capabilities:[intel 1G] ServiceDomains:[worker.domain] VirtualFunctions:[]}]}
Dec 20 01:58:08.593 [FATA] error configuring forwarder endpoint: lstat /sys/bus/pci/devices/0000:3d:02.0/iommu_group: no such file or directory; error getting info about specified file: /sys/bus/pci/devices/0000:3d:02.0/iommu_group; github.com/networkservicemesh/sdk-sriov/pkg/sriov/pcifunction.evalSymlinkAndGetBaseName; /go/pkg/mod/github.com/networkservicemesh/sdk-sriov@v1.11.1/pkg/sriov/pcifunction/tools.go:50; github.com/networkservicemesh/sdk-sriov/pkg/sriov/pcifunction.(*Function).GetIOMMUGroup; /go/pkg/mod/github.com/networkservicemesh/sdk-sriov@v1.11.1/pkg/sriov/pcifunction/function.go:75; github.com/networkservicemesh/sdk-sriov/pkg/sriov/pci.UpdateConfig; /go/pkg/mod/github.com/networkservicemesh/sdk-sriov@v1.11.1/pkg/sriov/pci/update_config.go:33; main.createSriovInterposeEndpoint; /build/main.go:359; main.createInterposeEndpoint; /build/main.go:319; main.main; /build/main.go:197; runtime.main; /usr/local/go/src/runtime/proc.go:250; runtime.goexit; /usr/local/go/src/runtime/asm_amd64.s:1598;
[root@CNCP-MS-01 deployments-k8s-release-v1.11.1]# dmesg | grep -E "DMAR|IOMMU"
[ 0.016059] ACPI: DMAR 0x0000000068D7C1C8 000248 (v01 ALASKA A M I 00000001 INTL 20091013)
[ 1.961752] DMAR: Host address width 46
[ 1.961753] DMAR: DRHD base: 0x000000d37fc000 flags: 0x0
[ 1.961760] DMAR: dmar0: reg_base_addr d37fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.961761] DMAR: DRHD base: 0x000000e0ffc000 flags: 0x0
[ 1.961765] DMAR: dmar1: reg_base_addr e0ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.961766] DMAR: DRHD base: 0x000000ee7fc000 flags: 0x0
[ 1.961770] DMAR: dmar2: reg_base_addr ee7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.961771] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[ 1.961774] DMAR: dmar3: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.961775] DMAR: DRHD base: 0x000000aaffc000 flags: 0x0
[ 1.961779] DMAR: dmar4: reg_base_addr aaffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.961780] DMAR: DRHD base: 0x000000b87fc000 flags: 0x0
[ 1.961783] DMAR: dmar5: reg_base_addr b87fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.961784] DMAR: DRHD base: 0x000000c5ffc000 flags: 0x0
[ 1.961788] DMAR: dmar6: reg_base_addr c5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.961789] DMAR: DRHD base: 0x0000009d7fc000 flags: 0x1
[ 1.961792] DMAR: dmar7: reg_base_addr 9d7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.961793] DMAR: RMRR base: 0x0000006b624000 end: 0x0000006b635fff
[ 1.961796] DMAR: ATSR flags: 0x0
[ 1.961797] DMAR: RHSA base: 0x0000009d7fc000 proximity domain: 0x0
[ 1.961798] DMAR: RHSA base: 0x000000aaffc000 proximity domain: 0x0
[ 1.961798] DMAR: RHSA base: 0x000000b87fc000 proximity domain: 0x0
[ 1.961799] DMAR: RHSA base: 0x000000c5ffc000 proximity domain: 0x0
[ 1.961800] DMAR: RHSA base: 0x000000d37fc000 proximity domain: 0x1
[ 1.961800] DMAR: RHSA base: 0x000000e0ffc000 proximity domain: 0x1
[ 1.961801] DMAR: RHSA base: 0x000000ee7fc000 proximity domain: 0x1
[ 1.961801] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x1
[ 1.961804] DMAR-IR: IOAPIC id 12 under DRHD base 0xc5ffc000 IOMMU 6
[ 1.961805] DMAR-IR: IOAPIC id 11 under DRHD base 0xb87fc000 IOMMU 5
[ 1.961806] DMAR-IR: IOAPIC id 10 under DRHD base 0xaaffc000 IOMMU 4
[ 1.961807] DMAR-IR: IOAPIC id 18 under DRHD base 0xfbffc000 IOMMU 3
[ 1.961808] DMAR-IR: IOAPIC id 17 under DRHD base 0xee7fc000 IOMMU 2
[ 1.961809] DMAR-IR: IOAPIC id 16 under DRHD base 0xe0ffc000 IOMMU 1
[ 1.961809] DMAR-IR: IOAPIC id 15 under DRHD base 0xd37fc000 IOMMU 0
[ 1.961811] DMAR-IR: IOAPIC id 8 under DRHD base 0x9d7fc000 IOMMU 7
[ 1.961812] DMAR-IR: IOAPIC id 9 under DRHD base 0x9d7fc000 IOMMU 7
[ 1.961812] DMAR-IR: HPET id 0 under DRHD base 0x9d7fc000
[ 1.961814] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 1.963751] DMAR-IR: Enabled IRQ remapping in x2apic mode
[root@CNCP-MS-01 deployments-k8s-release-v1.11.1]# lsmod | grep vfio_pci
vfio_pci 53248 0
vfio_virqfd 16384 1 vfio_pci
vfio 32768 2 vfio_iommu_type1,vfio_pci
irqbypass 16384 2 vfio_pci,kvm
Perhaps it is caused by ”[FATA] error configuring forwarder endpoint: lstat /sys/bus/pci/devices/0000:3d:02.0/iommu_group: no such file or directory”
Hi @316953425 ,
Perhaps it is caused by ”[FATA] error configuring forwarder endpoint: lstat /sys/bus/pci/devices/0000:3d:02.0/iommu_group: no such file or directory”
Yes, it's definitely related to that log printout. It's strange that ethtool shows bus info: 0000:3d:00.0
, but the forwarder is looking for 0000:3d:02.0
(another device number: 02)
Could you check where the sys-fs link is pointing?
ls -l /sys/class/net/ens5/device
In theory one of the following might also cause such printout:
I've found this description: https://www.kernel.org/doc/html/latest/driver-api/vfio.html#vfio-usage-example
Could you check the output of the command on one of the master nodes:
readlink /sys/bus/pci/devices/0000:3d:02.0/iommu_group
I'm not sure but if the IOMMU is enabled then the output should contain the "DMAR: IOMMU enabled" line. Could you also check the grub config of master nodes if contains the "intel_iommu=on" option?
hi @glazychev-art @ljkiraly
Could you check where the sys-fs link is pointing? ls -l /sys/class/net/ens5/device
[root@CNCP-MS-01 deployments-k8s-release-v1.11.1]# ls -l /sys/class/net/ens5/device
lrwxrwxrwx 1 root root 0 12月 21 08:39 /sys/class/net/ens5/device -> ../../../0000:3d:00.0
readlink /sys/bus/pci/devices/0000:3d:02.0/iommu_group
[root@CNCP-MS-01 deployments-k8s-release-v1.11.1]# readlink /sys/bus/pci/devices/0000:3d:02.0/iommu_group
[root@CNCP-MS-01 deployments-k8s-release-v1.11.1]#
[root@CNCP-MS-01 deployments-k8s-release-v1.11.1]# lspci | grep net
19:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
3d:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
3d:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
3d:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
3d:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
3d:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:02.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:02.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:02.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:02.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:02.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:02.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:02.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:03.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:03.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:03.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:03.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:03.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:03.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:03.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:03.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:04.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:04.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:04.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:04.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:04.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:04.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:04.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:04.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:05.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:05.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:05.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:05.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:05.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:05.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:05.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
3d:05.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 09)
5e:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
I'm not sure but if the IOMMU is enabled then the output should contain the "DMAR: IOMMU enabled" line. Could you also check the grub config of master nodes if contains the "intel_iommu=on" option?
yes
[root@CNCP-MS-01 deployments-k8s-release-v1.11.1]# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=auto spectre_v2=retpoline rd.lvm.lv=centos/root rd.lvm.lv=centos/swap net.ifnames=0 biosdevname=0 rhgb quiet intel_iommu=on iommu=pt"
GRUB_DISABLE_RECOVERY="true"
It's strange that ethtool shows bus info: 0000:3d:00.0, but the forwarder is looking for 0000:3d:02.0 (another device number: 02)
This also makes me very confused, but the file /sys/bus/pci/devices/0000:3d:00.0/iommu_group does not exist either.
[root@CNCP-MS-01 use-cases]# ls /sys/bus/pci/devices/0000:3d:00.0/ | grep iommu_group
[root@CNCP-MS-01 use-cases]#
[root@CNCP-MS-01 use-cases]#
The network card used in my configuration (pciaddress:0000:3d:00.0) is not the network card currently used by k8s cni. I am not sure whether this has any impact?
Also, I don’t quite understand the meaning of serviceDomains. I filled in worker.domain according to the example in the document. Can you tell me the meaning of serviceDomains?
Finally, what I don’t quite understand is mentioned in the document https://github.com/networkservicemesh/deployments-k8s/tree/release/v1.11.1/examples/sriov Why the serviceDomains value of the master node is worker.domain, but the serviceDomains value of the worker node is master.domain,Could you tell me?
thanks
[root@CNCP-MS-02 ~]# cat /var/lib/networkservicemesh/smartnic.config
physicalFunctions:
0000:3d:00.0:
pfKernelDriver: i40e
vfKernelDriver: i40evf
capabilities:
- intel
- 1G
serviceDomains:
- worker.domain
Hi @316953425 , @glazychev-art ,
Also, I don’t quite understand the meaning of serviceDomains. I filled in worker.domain according to the example in the document. Can you tell me the meaning of serviceDomains?
That part of your configuration is correct. I didn't wanted to confuse you. This gives a possibility to refer a physical resource by serviceDomain/capability (for example sriovToken=worker.domain/1G).
We should focus to network card configuration on the nodes. Did you restarted the node after the grub config (or this configuration was already present from the start)? Do you have any node where the forwarder starts properly?
hi @glazychev-art @ljkiraly
Did you restarted the node after the grub config (or this configuration was already present from the start)?
yes ,I restart the node after the grub config
Do you have any node where the forwarder starts properly?
No, All nodes cannot start forward normally, and the error messages are the same.
hi @glazychev-art Have we successfully deployed it(https://github.com/networkservicemesh/deployments-k8s/tree/release/v1.11.1/examples/ovs) before? thanks
hi @316953425 We've discussed this issue a bit - we haven't actually run these examples for a while. Perhaps we will consider this issue in the next release.
hi @glazychev-art I deploy https://github.com/networkservicemesh/deployments-k8s/tree/release/v1.11.1/examples/ovs failed. My k8s environment has three master nodes. I suspect it is my configuration file /var/lib/networkservicemesh/smartnic.config have some problems. /var/lib/networkservicemesh/smartnic.config is the same on each master node, and each master node has the same network card configuration.
The content is as follows:
pod error:
Could you tell me how to fix it ,thanks