microsoft / azure-linux-kernel

Patches for building an Azure-tuned Linux kernel.
Other
49 stars 25 forks source link

Q: what is tx_send_full? #36

Closed woosley closed 4 years ago

woosley commented 4 years ago

Hi there, I am tracing a production network issue in Azure cloud. We see random outbound timeout on some of the VMs, on those VM, TCPdump showed a lot of sync retranssmission to external host.

When we look the output for "ethtool -S eth0", the only abnomal stat is tx_send_full

NIC statistics:
     tx_scattered: 0
     tx_no_memory: 0
     tx_no_space: 0
     tx_too_big: 0
     tx_busy: 0
     tx_send_full: 17464

I traced the kernel code, and it seems this is the place where the code start from. So, can you explain what is this metrics and why it increases?

dcui commented 4 years ago

Hi woosley, the high tx_send_full number may be normal. Haiyang should be able to explain it in detail.

Can you please share more info: Which VM size are you using on Azure? Which region is the VM in? Is Accelerated Networking (i.e. NIC SR-IOV) enabled? How many NICs does your VM have? Are they all enabled with Accelerated Networking?

Can you share the VM's full serial console log (we can get this from Azure portal, if the VM enables Boot Diagnostics), the 'dmesg' , the syslog file and the full output of "ethtool -S eth0"? I assume when the network issue happens, you're still able to login the VM via the Azure Serial Console?

Which Linux distro(s) and kernel version(s) are you using (please share "uname -r")?

What's the symptom of the network issue -- you're unable to ssh to the VM due to Connection Timeout? Are you still able to ping the VM from another good VM in the same VNET?

How easily can you reproduce the issue? What's the workload to reproduce the issue? Do you think if it's possible for us to easily reproduce the issue with the same workload?

When the network issue happens, if you're still able to login the VM via Azure serial console: do you notice any other symptoms, e.g. slow disk read/write? high CPU utilization? If your kernel creates the channels directory (find /sys -name channels), can you find the channels directory of your affected NIC (e.g. eth0) and dump the state of the ringbuffers by:

cd /sys/class/net/eth0/device/channels/; set -x; for i in *; do cat $i/*; done; set +x; sleep 30; set -x; for i in *; do cat $i/*; done; set +x;

dcui commented 4 years ago

@haiyangz

haiyangz commented 4 years ago

tx_send_full only means the netvsc send-buffer is full temporarily, but the outgoing packets can still go out with GPA (guest physical address) list.

woosley commented 4 years ago

Hi @dcui The issue is that sometimes we can not connect to our external API gateway in our K8s slave nodes. We see those timeout logs in our pod and if I do curl https://public_url in a for loop for 100 times on the host , I can certainly see timeouts, tcpdump shows sync retransmission for those timeouts.

3:56:36.399858 IP 172.30.89.219.52770 > 40.73.105.167.https: Flags [S], seq 2810817884, win 28200, options [mss 1410,sackOK,TS val 2243237388 ecr 0,nop,wscale 7], length 0
13:56:40.403863 IP 172.30.89.219.52770 > 40.73.105.167.https: Flags [S], seq 2810817884, win 28200, options [mss 1410,sackOK,TS val 2243241392 ecr 0,nop,wscale 7], length 0
13:56:48.419845 IP 172.30.89.219.52770 > 40.73.105.167.https: Flags [S], seq 2810817884, win 28200, options [mss 1410,sackOK,TS val 2243249408 ecr 0,nop,wscale 7], length 0

This issue is not happening on other VMs, including the k8s master nodes.

I compared the sysctl, ethtool -S information etc. the only difference I can tell currently is that on those slave nodes, tx_send_full is high while other VMs, this value is always 0.

dcui commented 4 years ago

FYI: Here is some info about Accelerated Networking(AN). Your syslog and "ethtool -S eth0" show that your VM doesn't enable AN, so it looks you're seeing a bug in the Hyper-V network device driver hv_netvsc.

Your kernel version is 3.10.0-862.11.6.el7.x86_64 (i.e. CentOS 7.5). There is a known issue with this kernel version and the fix is hv_netvsc: Fix napi reschedule while receive completion is busy. We also know CentOS 7.5 has other related bugs that can also cause the similar symptom if the VM is under heavy workloads.

I suggest you should upgrade to the latest CentOS 7.7 (at least please upgrade the kernel part to that in 7.7, i.e. 3.10.0-1062.x), because 7.7 has all the related fixes.

If you have to stay with CentOS 7.5, please consider replacing the built-in Hyper-V drivers (i.e. LIS drivers) with the Microsoft-maintained version: LIS 4.3.4, which also includes all the related fixes.

Please let us know if either solution can work for you.

woosley commented 4 years ago

thanks @dcui we will try upgrade to Centos7.7 and validate the issue again.

dcui commented 4 years ago

I just realized another bug may cause the similar symptom and the fix is 2019-05-03 hv_netvsc: fix race that may miss tx queue wakeup. This fix is not present in RHEL 7.7 and earlier. The fix has been incorporated into RHEL 7.8-beta. I expect that the formal RHEL 7.8 will be released in a few months and I suggest you should upgrade to RHEL 7.8 then.

The Microsoft-maintained version of the Linux drivers (i.e. LIS 4.3.4 ) has also included the fix, so you may consider installing LIS 4.3.4 in your RHEL/CentOS 7.5/7.6/7.7 and older (if you're using RHEL rather than CentOS, please consult with your RHEL technical support before doing so).

woosley commented 4 years ago

@dcui, can you clarify the relationship between tx_send_full and the timeout issue? Is it possible we see timeouts even if there is no tx_send_full? we still seeing some timeouts after upgraded to Centos7.7.

dcui commented 4 years ago

The guest has 2 ways to send a packet to the host and the host will send the packet to the network on behalf of the guest: 1): for a packet that's <= 6KBytes, the guest copies the data of the packet into a host/guest-shared buffer and tells the offset of the data to the host; 2): if the packet length is >6KB or 1) fails (i.e. the host/guest-shared buffer is full and the host is not fetching data from the buffer fast enough. In this case, the tx_send_full is increased by 1), the guest tells the physical address and the length of the packet to the host and the host sends the packet to the network on behalf of the guest. 2) avoids an extra copy of data, but the host needs to map/unmap the guest's physical address, so it also has a cost and we only use it for big packets (i.e. the packet length is >6KB).

So typically a non-zero tx_send_full itself is not an error, but if the tx_send_full goes very big, it might mean: 1) there may be a guest bug (the guest is not managing the shared buffer correctly); or, 2) the host is very busy and can not fetch data from the buffer fast enough; or 3) the host is not managing the buffer correctly. So far we don't know which one is more likely, based on the existing info. I suggest you dump the NIC's state (please refer to my first reply on Feb 5) in your CentOS/RHEL7.7 VM by checking the files in the channels directory of the affected NIC, and please also share the full output of "ethtool -S eth0" (here I assume eth0 is the affected NIC in your VM). With the info we should have a better understanding about any possible guest bug. Please also try LIS 4.3.4, which integrated the fixes to all the known issues we know of.

Having said all these, I don't know why you have the timeout sometimes and you see the retransmission of the sync packets. IMO that may be caused by a bug in the guest NIC driver, or some issue in the local/remote applications/bridge drivers or the intermediate network link.

woosley commented 4 years ago

this time I did not see tx_send_full when the timeout happens , below is a ethtool -S eth0 output

NIC statistics:
     tx_scattered: 0
     tx_no_memory: 0
     tx_no_space: 0
     tx_too_big: 0
     tx_busy: 0
     tx_send_full: 0
     rx_comp_busy: 0
     rx_no_memory: 0
     stop_queue: 0
     wake_queue: 0
     vf_rx_packets: 0
     vf_rx_bytes: 0
     vf_tx_packets: 0
     vf_tx_bytes: 0
     vf_tx_dropped: 0
     tx_queue_0_packets: 11628216
     tx_queue_0_bytes: 4814380063
     rx_queue_0_packets: 10637675
     rx_queue_0_bytes: 5201262156
     tx_queue_1_packets: 11330465
     tx_queue_1_bytes: 3776043306
     rx_queue_1_packets: 11473832
     rx_queue_1_bytes: 4577874125
     tx_queue_2_packets: 15933712
     tx_queue_2_bytes: 6490843736
     rx_queue_2_packets: 9994695
     rx_queue_2_bytes: 5961073450
     tx_queue_3_packets: 12533031
     tx_queue_3_bytes: 4375413318
     rx_queue_3_packets: 9344414
     rx_queue_3_bytes: 4948804006
     tx_queue_4_packets: 16718318
     tx_queue_4_bytes: 7486636383
     rx_queue_4_packets: 12447042
     rx_queue_4_bytes: 5779920486
     tx_queue_5_packets: 12516778
     tx_queue_5_bytes: 5237786709
     rx_queue_5_packets: 9029245
     rx_queue_5_bytes: 4530007405
     tx_queue_6_packets: 12697256
     tx_queue_6_bytes: 4511212824
     rx_queue_6_packets: 9792604
     rx_queue_6_bytes: 4066527064
     tx_queue_7_packets: 13411273
     tx_queue_7_bytes: 4218531240
     rx_queue_7_packets: 17727778

below is the log of the eth0 channel status https://paste.ubuntu.com/p/6yVR8b3MMc/ https://paste.ubuntu.com/p/gVFyG6pWPc/

woosley commented 4 years ago

Also I am a bit confused on the sync retransimission sympton. IMO, if, in the output of TCPDump, the sync packet leaves eth0, this should rule out any issue on the guest VM since the packet already left the NIC. If there is a retransmission, it basically means that the packet is lost somewhere, there is nothing to do with the guest.

dcui commented 4 years ago

The output of "ethtool -S" is normal. The info of the 'channels/' is also normal: it shows both the guest-to-host ring and the host-to-guest ring are empty. So I don't see an issue in the hv_netvsc NIC driver.

About the sync retransimission sympton, I suspect some packets are lost somewhere outside of the guest.

dcui commented 4 years ago

FYI: I'm not sure if Azure drops some SYN packets due to SNAT port exhaustion: https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections#snatexhaust (the page says "If your application initiates many outbound flows and you experience SNAT port exhaustion, consider assigning a Public IP address to mitigate SNAT constraints. Review Managing SNAT exhaustion in its entirety.")

xuzhao1988 commented 4 years ago

@dcui Hi, actually in our China East 2 region setup, for outbound connection, the traffic has been routed to Azure firewall with 4 public IPs to avoid SNAT port exhaustion. Plus, since we are just beginning to deploy apps in east 2 region, there is not too much outbound traffic.

dcui commented 4 years ago

Can the symptom go away if you install the LIS 4.3.4 package? If this doesn't work either, then it will be more unlikely to be a guest issue, and I suspect we can not further debug the issue from within the guest. You probably need to open a ticket so Azure support will get involved to help check if any packets are missing outside of the guest.

BTW, the 2 bugs I mentioned previously only happen under very heavy network workloads. I really don't think you should hit them by just running "curl https://public_url in a for loop for 100 times ".

woosley commented 4 years ago

I have updated LIS 4.3.4 and the alerts seem reduced a lot so the issue might be fix. but if I look at those VMs, tx_send_full can be high still.

woosley commented 4 years ago

close this issue since we did not see this again.