intel / ethernet-linux-iavf

GNU General Public License v2.0
1 stars 3 forks source link

update unsupported branch with 4.5.3 hotfixes #2

Open jbrandeb opened 3 months ago

jbrandeb commented 3 months ago

Both of these updates were minimal changes for a bug fix

4.5.3.4: Intel ID: 22018540857

Description: 22018540857: Getting resource busy while SRIOV VF resources are requested

Root Cause: A race between iavf_watchdog_task() and iavf_adminq_task(), where in the iavf_adminq_task() saves event data from the VIRTCHNL_EVENT_RESET_IMPENDING, but then gets blocked waiting for the critical section lock while the iavf_watchdog_task() handles the device reset. Once iavf_watchdog_task() finally completes the reset, the iavf_adminq_task() continues executing and processes the now-stale VIRTHCHNL_EVENT_RESET_IMPENDING message. This causes the driver to set the reset state flag, and forces the watchdog into waiting for the hardware reset to start. Since 3.7.81, the driver will stay waiting for reset indefinitely until the next time that hardware reset occurs. This causes most userspace interactions to report -EBUSY, and causes the driver to print "Never saw reset" once every 5 seconds.

Resolution Notes: Fix the locking in iavf_adminq_task() so that it acquires the critical section lock before reading from the receive queue. With this change, the thread will either execute before iavf_watchdog_task() or after it. If it does execute after iavf_watchdog_task(), then it will not see the stale VIRTCHNL_EVENT_RESET_IMPENDING. This is because the message will be cleared when the hardware resets and the driver re-initializes the receive queue. Thus, the iavf_adminq_task() will no longer report a stale reset event.

Testing Hints: 1) reduce the number of CPUs available by using isolcpus when booting to limit to only 1 or 2 cores available for kernel tasks. (This matches our customer setup) 2) create 32 VFs 3) spam changing port VLAN on all 32 VFs, turning it on, then off. 4) also spam reading statistics from ethtool on all 32 VFs

The ethtool stats spam is done in order to make sure the watchdog task is executing rapidly, as otherwise it only operates once every 2 seconds when no background tasks are executing. The race only manifests if the watchdog task is already executing when the adminq task tries to acquire the critical section lock.

Able to reproduce this easily (within 5-10 minutes) on 32 VFs using the above setup. With the fix, no issue occurs.

4.5.3.2: Intel ID: 22018832934

Description: VF resets result in "Failed to init adminq: -53" errors, triggering hangs

Root Cause: Do not call iavf_close in iavf_handle_hw_reset error handling path as it can lead to double call of napi_disable, which leads to the deadlock in kernel.

Resolution Notes: In this driver the issue was that iavf_close was waiting for __IAVF_IN_CRITICAL_TASK bit while the iavf_watchdog_task which called all this code already owns this bit. This lead to the deadlock reported by iavf_remove (when triggered).

Fix this by calling iavf_disable_vf without running the clean up logic.

Testing Hints: Loop( ip link set vf_intf up ip l set dev pf_intf vf 0 trust on; ip l set dev pf_intf vf 0 vlan 333; ip l set dev pf_intf vf 0 trust off; ip l set dev pf_intf vf 0 vlan 310; ip l set vf_intf down sleep 0.1 )