i40e_netmap_rxsync race condition

mihadarktrace commented 3 years ago

Hi,

We believe we found a possible underlying cause for https://github.com/luigirizzo/netmap/issues/414. We've been having a host of issues in which the rings are inaccessible to netmap with a lot of i40e_netmap_rxsync [r] ring eth5 RX0 is missing (rxr=00000000ab224491) messages in the kernel logs. These only seem to happen in certain, unclear, situations, making them hard to reproduce.

In one specific issue however, we managed to see through dynamic debugging that the ring missing messages are preceded by the following:

i40e_dcbnl_del_app: i40e_netmap 0000:3d:00.2: Deleting app for VSI seid=398 err=-2 sel=1 proto=0x8906 prio=3
i40e_dcb_need_reconfig: i40e_netmap 0000:3d:00.3: ETS TSA Table changed.
i40e_dcb_need_reconfig: i40e_netmap 0000:3d:00.3: PFC config change detected.
i40e_dcb_need_reconfig: i40e_netmap 0000:3d:00.3: APP Table change detected.
i40e_dcb_need_reconfig: i40e_netmap 0000:3d:00.3: dcb need_reconfig=1
i40e_netmap_rxsync   [r] ring eth5 RX0 is missing (rxr=0000000059ad3b29)

A look at the driver code suggested that this happened in the i40e_handle_lldp function (in i40e_main.c) which gets called when the driver is handling message passed by the firmware through the admin queue. Looking further into the above function shows that it calls i40e_pf_quiesce_all_vsi(pf) (supposing all previous checks are passed, which the above logging suggests they are). This quiesce function deletes all rings and ring descriptors and sets them to NULL - which would explain why the netmap ...sync functions find them in such a state. Disabling the i40e firmware LLDP agent in this specific case has solved the problem and the messages about the ring missing have dissappeared. For reference, the i40e LLDP firmware agent can be disabled as described here https://advantech-ncg.zendesk.com/hc/en-us/articles/360020364512-How-to-Disable-LLDP-agent-on-XL710-in-Linux.

Before calling i40e_handle_lldp the driver does use a lock/semaphore through rtnl_lock() in i40e_clean_adminq_subtask, but from what I've been able to find that is meant to protect access to the netdev kernel structure as opposed to the rings themselves. A solution to the above issues might be to simply try to lock it at the beginning of i40e_netmap_rxsync/txsync but I do not have enough knowledge about kernel programming to say that this would be optimal or even correct and would not impact performance.

There are other places in which the above i40e_pf_quiesce_all_vsi(pf) function is called (most of them DCB related) so another possible solution would be to compile with CONFIG_DCB undefined (where that is possible) and removing them entirely.

Another place where this seems to happen is in when the card resets, via the set_bit(...) and i40e_service_event_schedule functions, but I have not looked in detail as to how/if netmap handles this specific situation.

Please correct my reasoning on any of the above points if it's incorrect, my familiarity with driver code is very limited and I just recently started diving into it in more detail.

giuseppelettieri commented 3 years ago

Hi @mihadarktrace, thank you for the detailed exploration of this issue! I agree with your analysis, but calling rtnl_lock() in the data path is out of the question (this is a global lock used used in the slow control path).

We have the infrastructure in place to protect netmap from these ring resets events, but we need to patch in some netmap calls in the right places. I'll have a look.

giuseppelettieri commented 3 years ago

May you please try the following patch?

diff --git a/LINUX/default-config.mak.in_ b/LINUX/default-config.mak.in_
index 514ec80b4..9a4ce3094 100644
--- a/LINUX/default-config.mak.in_
+++ b/LINUX/default-config.mak.in_
@@ -82,7 +82,7 @@ igb@prepare := $(if $(filter $(igb@v),5.3.5.61 5.3.6 5.4.6 5.5.2),@SRCDIR@/intel
 e1000e@prepare := $(if $(filter $(e1000e@v),3.8.4),@SRCDIR@/intel-fix.sh e1000e,)
 ixgbevf@prepare := $(if $(filter $(ixgbevf@v),4.7.1 4.8.1 4.9.3 4.10.2 4.11.1),@SRCDIR@/intel-fix.sh ixgbevf,)
 ixgbe@prepare := $(if $(filter $(ixgbe@v),5.8.1 5.9.4 5.10.2 5.11.3),@SRCDIR@/intel-fix.sh ixgbe,)
-i40e@prepare := $(if $(filter $(i40e@v),2.12.6 2.14.13 2.15.9),@SRCDIR@/intel-fix.sh i40e,)
+i40e@prepare := @SRCDIR@/i40e-redirect.sh$(if $(filter $(i40e@v),2.12.6 2.14.13 2.15.9),; @SRCDIR@/intel-fix.sh i40e,)

 # some additional, driver-specific configuration
 stmmac@conf := CONFIG_STMMAC_ETH
diff --git a/LINUX/i40e-redirect.sh b/LINUX/i40e-redirect.sh
new file mode 100755
index 000000000..28299a7e5
--- /dev/null
+++ b/LINUX/i40e-redirect.sh
@@ -0,0 +1,5 @@
+#!/bin/sh
+
+grep -q DEV_NETMAP i40e/i40e_main.c || exit 0
+sed -i -e 's/^void i40e_down(/static void __i40e_down(/
+      s/^static int i40e_up_complete(/static int __i40e_up_complete(/' i40e/i40e_main.c
diff --git a/LINUX/i40e_netmap_linux.h b/LINUX/i40e_netmap_linux.h
index 0f6018888..a6e2b46ca 100644
--- a/LINUX/i40e_netmap_linux.h
+++ b/LINUX/i40e_netmap_linux.h
@@ -292,6 +292,24 @@ i40e_netmap_attach(struct i40e_vsi *vsi)
    netmap_attach(&na);
 }

+static void __i40e_down(struct i40e_vsi *vsi);
+void i40e_down(struct i40e_vsi *vsi)
+{
+   if (vsi->netdev)
+       netmap_disable_all_rings(vsi->netdev);
+   __i40e_down(vsi);
+}
+
+static int __i40e_up_complete(struct i40e_vsi *vsi);
+static int i40e_up_complete(struct i40e_vsi *vsi)
+{
+   int rv = __i40e_up_complete(vsi);
+   if (vsi->netdev)
+       netmap_enable_all_rings(vsi->netdev);
+   return rv;
+}
+
+
 #else /* NETMAP_I40E_MAIN */

luigirizzo / netmap

i40e_netmap_rxsync race condition #804