Hi Team,
one of the customer using our solution, a custom image which is based on Linux kernel 4.14.51 and Centos 7.8. Customer is facing random traffic loss in production on netvsc interfaces (non-accelerated) . P.S. deployment size is ~600 VM instances.
(1) hv_netvsc: Fix napi reschedule while receive completion is busy
(2) hv_netvsc: fix race that may miss tx queue wakeup
(3) now with these patches there is some improvement in the sense few instances are getting into this problem , but issue still persists(~5 out of ~200) . On these bad instances ethtool stats shows very high 'rx_comp_busy & tx_send_full' as shown below. I think super high 'rx_comp_busy' is expected after these patches
I would request azure team to provide list of patches that we can try with 4.14.51 kernel as LIS option is not applicable to us.
please let me know if I can provide any additional details.
Hi Team, one of the customer using our solution, a custom image which is based on Linux kernel 4.14.51 and Centos 7.8. Customer is facing random traffic loss in production on netvsc interfaces (non-accelerated) . P.S. deployment size is ~600 VM instances.
1) on problematic instances 'rx_comp_busy' is always non zero(=1) and a high of 'tx_send_full' as shown below -bash-4.2# ethtool -S eth2 NIC statistics: tx_scattered: 0 tx_no_memory: 0 tx_no_space: 0 tx_too_big: 0 tx_busy: 0 tx_send_full: 75776 <<<<<<<<<<<<< rx_comp_busy: 1 <<<<<<<<<<<<<<<< vf_rx_packets: 0 vf_rx_bytes: 0 vf_tx_packets: 0 vf_tx_bytes: 0 vf_tx_dropped: 0 tx_queue_0_packets: 48323650 tx_queue_0_bytes: 9856533412 rx_queue_0_packets: 70704892 rx_queue_0_bytes: 6523868834 tx_queue_1_packets: 44242587 tx_queue_1_bytes: 9561505139 rx_queue_1_packets: 67683390 rx_queue_1_bytes: 6248204528 tx_queue_2_packets: 45780035 tx_queue_2_bytes: 10119440310 rx_queue_2_packets: 69738233 rx_queue_2_bytes: 6443619208 tx_queue_3_packets: 44413637 tx_queue_3_bytes: 9640385380 rx_queue_3_packets: 69258427 rx_queue_3_bytes: 6396199857 tx_queue_4_packets: 96161043 tx_queue_4_bytes: 43152567515 rx_queue_4_packets: 68506662 rx_queue_4_bytes: 6329763902 tx_queue_5_packets: 42685859 tx_queue_5_bytes: 9232930840 rx_queue_5_packets: 68869195 rx_queue_5_bytes: 6360734718 tx_queue_6_packets: 44105935 tx_queue_6_bytes: 9641517238 rx_queue_6_packets: 71297219 rx_queue_6_bytes: 6568436535 tx_queue_7_packets: 44680296 tx_queue_7_bytes: 9764630663 rx_queue_7_packets: 70747471
(2) we have rebuild the kernel with below 2-patches as this symptom (napi gets disable when ring is temporary busy ) is similar to issue mentioned in https://github.com/microsoft/azure-linux-kernel/issues/36
(1) hv_netvsc: Fix napi reschedule while receive completion is busy (2) hv_netvsc: fix race that may miss tx queue wakeup
(3) now with these patches there is some improvement in the sense few instances are getting into this problem , but issue still persists(~5 out of ~200) . On these bad instances ethtool stats shows very high 'rx_comp_busy & tx_send_full' as shown below. I think super high 'rx_comp_busy' is expected after these patches
-bash-4.2# ethtool -S eth2 NIC statistics: tx_scattered: 0 tx_no_memory: 0 tx_no_space: 0 tx_too_big: 0 tx_busy: 0 tx_send_full: 417979<<<<<<<<<<<<<<<<<<< rx_comp_busy: 36978379935<<<<<<<<<<<<<< rapid fast increments vf_rx_packets: 0 vf_rx_bytes: 0 vf_tx_packets: 0 vf_tx_bytes: 0 vf_tx_dropped: 0 tx_queue_0_packets: 22487545 tx_queue_0_bytes: 4594218563 rx_queue_0_packets: 33816104 rx_queue_0_bytes: 3148800004 tx_queue_1_packets: 23095847 tx_queue_1_bytes: 4629433827 rx_queue_1_packets: 34169457 rx_queue_1_bytes: 3198473995 tx_queue_2_packets: 22235899 tx_queue_2_bytes: 4554101089 rx_queue_2_packets: 35447873 rx_queue_2_bytes: 3306351633 tx_queue_3_packets: 22655564 tx_queue_3_bytes: 4658776077 rx_queue_3_packets: 34320559 rx_queue_3_bytes: 3200636386 tx_queue_4_packets: 43152346 tx_queue_4_bytes: 17461777045 rx_queue_4_packets: 34941411 rx_queue_4_bytes: 3240195702 tx_queue_5_packets: 22992696 tx_queue_5_bytes: 4613837166 rx_queue_5_packets: 32975505 rx_queue_5_bytes: 3079512739 tx_queue_6_packets: 22535083 tx_queue_6_bytes: 4672503110 rx_queue_6_packets: 33796904 rx_queue_6_bytes: 3159691807 tx_queue_7_packets: 22452840 tx_queue_7_bytes: 4584966389 rx_queue_7_packets: 33860772 rx_queue_7_bytes: 3155304090 rx_queue_7_bytes: 6525418289
I would request azure team to provide list of patches that we can try with 4.14.51 kernel as LIS option is not applicable to us. please let me know if I can provide any additional details.