microsoft / ntttcp-for-linux

A Linux network throughput multiple-thread benchmark tool.
https://github.com/Microsoft/ntttcp-for-linux
MIT License
361 stars 87 forks source link

Ntttcp UDP receiver periodically reports error (cannot read data from sender socket) and 0 Mbps real-time Tput amid sender keeps sending UDP traffic. #75

Closed wilsonNg007 closed 2 years ago

wilsonNg007 commented 2 years ago

I'm using nttttcp to send UDP traffic from one VM to another (remote host) VM. I'm keeping the transfer size to 1024B which is below 1 MTU size (1280B) to avoid any fragmentation. Below are the cmd lines setup. However, after a while, the receiver periodically reports "error: cannot read data from socket: 3", and also reports 0 Mbps real-time tput. But the sender still keeps sending UDP traffic (see attached screenshot).

There are no other applications running inside both VM except ntttcp processes. The UDP packets are transferred between port 8011 on both ends.

Questions: So what are port 60558 (socket 3) on sender and port 8010 (socket 6) on receiver used for in terms of what data it's trying to communicate?
Is the "0 Mbps" real-time tput reported on the receiver side bogus? It appears the receiver host Ethernet (Task manager on the right side of attached screenshot) is still continuously receiving udp packets, which will get steered into the VM. How can I debug this to find out what's wrong with ntttcp?

Thanks.

Ntttcp Receiver

windows_x86_64:/data/local/tmp/test $ ./ntttcp -r -m 1,*,10.231.111.24 -V -p 8011 -t 86400 -u -b 1024 NTTTCP for Linux 1.4.0

** receiver role ports: 1 cpu affinity: server address: 10.231.111.24 domain: IPv4 protocol: UDP server port starting at: 8011 receiver socket buffer (bytes): 1024 test warm-up (sec): no test duration (sec): 86400 test cool-down (sec): no show system tcp retransmit: no quiet mode: disabled verbose mode: enabled

23:49:48 DBG : user limits for maximum number of open files: soft: 32768; hard: 32768 23:49:48 DBG : Interface:[lo] Address: 127.0.0.1 23:49:48 DBG : Interface:[wlan0] Address: 10.231.111.24 23:49:48 INFO: 2 threads created 23:49:48 DBG : ntttcp server is listening on 10.231.111.24:8010 23:49:54 DBG : Sync connection: 10.231.111.166:60558 --> local:8010 [socket 6] 23:49:54 INFO: Network activity progressing...

Ntttcp Sender

1|windows_x86_64:/data/local/tmp/test $ ./ntttcp -s -m 1,*,10.231.111.24 -V -p 8011 -t 86400 -u -n 1 -b 1024 NTTTCP for Linux 1.4.0

** sender role connections: 1 X 1 X 1 cpu affinity: server address: 10.231.111.24 domain: IPv4 protocol: UDP server port starting at: 8011 sender socket buffer (bytes): 1024 test warm-up (sec): no test duration (sec): 86400 test cool-down (sec): no show system tcp retransmit: no quiet mode: disabled verbose mode: enabled

23:49:52 DBG : user limits for maximum number of open files: soft: 32768; hard: 32768 23:49:52 DBG : Sync connection: local:60558 [socket:3] --> 10.231.111.24:8010 23:49:52 INFO: 1 threads created 23:49:52 DBG : Running UDP stream: local:0 [socket:4] --> 10.231.111.24:8011 23:49:52 INFO: 1 connections created in 2086 microseconds 23:49:52 INFO: Network activity progressing...

ntttcp_udp_uniDirectional_receiverShowsZeroBitsPerSecTPut_Screenshot 2022-05-12 171957

wilsonNg007 commented 2 years ago

@simonxiaoss , @santoshx - any idea?

wilsonNg007 commented 2 years ago

I tried epoll option (-e) and did not specify particular port (no -p, so just use the default), and still seeing the same error/symptom.

wilsonNg007 commented 2 years ago

As another experiment, I tried no-sync option (-N), which effectively eliminates that sync connection socket from the setup. However, the receiver side does not show any real-time throughput; only the sender shows the real-time throughput. When I exit ntttcp, the receiver does not provide any summary output at the end; only the sender side provides summary output at the end. Using console log (-o) and xml file (-x) options did not log anything to the files either. So that means we cannot tell for sure if the receiver did receive the packets as well as other important metrics like throughput rate, cycles/bytes, etc.

wilsonNg007 commented 2 years ago

@LiliDeng , could you pls help take a look at my issue? I want to understand what the errors means and the impact, and what causes the error (if it's due to a potential tool's defect or if it's due to the VM network), and if there is any temporary workaround to avoid. Thanks.

simonxiaoss commented 2 years ago

You mentioned: ""error: cannot read data from socket: 3", and also reports 0 Mbps real-time tput. But the sender still keeps sending UDP traffic" Simon: It is a simple syscall "recvfrom" returned 0 bytes; so ntttcp was surprised that it did not receive any data from the socket then report this error message. Yes you also saw it from "0 Mbps".

You mentioned: "It appears the receiver host Ethernet (Task manager on the right side of attached screenshot) is still continuously receiving udp packets," Simon: Windows Task manager should get the packet numbers from the network driver. There are other layers between socket and driver. You might need to chase down to see where does the packets are dropped.

wilsonNg007 commented 2 years ago

Thanks @simonxiaoss

UDP is an unreliable protocol, so I'm wondering if this is an expected behavior or not?

I noticed on the receiver size, the real-time tput rate fluctuates more frequently on a wide range over time (as opposed to the sender).

I did another experiment which reduced the transfer pkt size to 512B (from previous 1KB), and still the receiver periodically reports the same errors.

ntttcp_udp_uniDirectional_Reduced512B_receiverErrorButPeriodicallyShowsZeroBitsPerSecTPut_Screenshot 2022-05-17 171957_capture4

wilsonNg007 commented 2 years ago

I queried the device file on the receiver (attached output below) and it doesn't have any dropped or error packets (same for the sender side). So it appears VM kernel didn't drop UDP packets. Looks like the Host is the suspect.

windows_x86_64:/data/local/tmp/test $ cat /proc/net/dev Inter-|Receive |Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 5813 55 0 0 0 0 0 0 5813 55 0 0 0 0 0 0 bond0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dummy0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tunl0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gre0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gretap0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 erspan0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ip_vti0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ip6_vti0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sit0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ip6tnl0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ip6gre0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth1: 164351009144 232460617 0 0 0 0 0 20794 1132368 2537 0 0 0 0 0 0 bond_wifi: 161110334364 232460953 0 0 0 0 0 20974 1132368 2537 0 0 0 0 0 0 wlan0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth4: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth5: 31132 336 0 0 0 0 0 180 0 0 0 0 0 0 0 0 eth6: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

wilsonNg007 commented 2 years ago

This is root-caused to an issue on the host/vm side.