Open lowjoel opened 5 months ago
I suggest to ran as a benchmark first iperf in the guest without WSL1. If there are issues, we will dig in. But if it is WSL tap issues, we can at best give some advices for that to look.
Best regards, Yan.
Another important comment: please run test with one stream. WSL1 definitely is not supporting multi-queue.
Wait, there's Windows binaries for iperf? Haha. I'll try that.
iperf is actually to make it more reproducible. What started this was my copies over Samba being slow from Host to Guest.
OK I've had to flip the server/client (run the iperf client on the host, server in the guest), but the results are the same. I used this binary without WSL: https://iperf.fr/iperf-download.php
MTU=1500
$ iperf3 --time 30 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 2.43 GBytes 695 Mbits/sec 0 sender
[ 5] 0.00-30.00 sec 2.43 GBytes 695 Mbits/sec receiver
$ iperf3 --time 30 --parallel 4 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 2.26 GBytes 647 Mbits/sec 0 sender
[ 5] 0.00-30.00 sec 2.26 GBytes 646 Mbits/sec receiver
[ 7] 0.00-30.00 sec 2.18 GBytes 624 Mbits/sec 0 sender
[ 7] 0.00-30.00 sec 2.18 GBytes 623 Mbits/sec receiver
[ 9] 0.00-30.00 sec 2.45 GBytes 700 Mbits/sec 0 sender
[ 9] 0.00-30.00 sec 2.44 GBytes 699 Mbits/sec receiver
[ 11] 0.00-30.00 sec 2.51 GBytes 719 Mbits/sec 0 sender
[ 11] 0.00-30.00 sec 2.51 GBytes 718 Mbits/sec receiver
[SUM] 0.00-30.00 sec 9.40 GBytes 2.69 Gbits/sec 0 sender
[SUM] 0.00-30.00 sec 9.38 GBytes 2.69 Gbits/sec receiver
$ iperf3 --time 30 --parallel 8 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 1.10 GBytes 315 Mbits/sec 0 sender
[ 5] 0.00-30.00 sec 1.10 GBytes 315 Mbits/sec receiver
[ 7] 0.00-30.00 sec 1.80 GBytes 517 Mbits/sec 0 sender
[ 7] 0.00-30.00 sec 1.80 GBytes 516 Mbits/sec receiver
[ 9] 0.00-30.00 sec 2.07 GBytes 594 Mbits/sec 0 sender
[ 9] 0.00-30.00 sec 2.07 GBytes 592 Mbits/sec receiver
[ 11] 0.00-30.00 sec 2.06 GBytes 591 Mbits/sec 0 sender
[ 11] 0.00-30.00 sec 2.06 GBytes 590 Mbits/sec receiver
[ 13] 0.00-30.00 sec 1.08 GBytes 310 Mbits/sec 0 sender
[ 13] 0.00-30.00 sec 1.08 GBytes 309 Mbits/sec receiver
[ 15] 0.00-30.00 sec 1.08 GBytes 309 Mbits/sec 1 sender
[ 15] 0.00-30.00 sec 1.07 GBytes 308 Mbits/sec receiver
[ 17] 0.00-30.00 sec 1.10 GBytes 314 Mbits/sec 1 sender
[ 17] 0.00-30.00 sec 1.09 GBytes 313 Mbits/sec receiver
[ 19] 0.00-30.00 sec 2.13 GBytes 610 Mbits/sec 0 sender
[ 19] 0.00-30.00 sec 2.13 GBytes 609 Mbits/sec receiver
[SUM] 0.00-30.00 sec 12.4 GBytes 3.56 Gbits/sec 2 sender
[SUM] 0.00-30.00 sec 12.4 GBytes 3.55 Gbits/sec receiver
I had a kernel panic when first running parallelism=8. Second run was OK. There's still nonlinear scaling.
MTU=9000
$ iperf3 --time 30 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 17.2 GBytes 4.94 Gbits/sec 0 sender
[ 5] 0.00-30.00 sec 17.2 GBytes 4.94 Gbits/sec receiver
$ iperf3 --time 30 --parallel 4 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 10.1 GBytes 2.88 Gbits/sec 1 sender
[ 5] 0.00-30.00 sec 10.1 GBytes 2.88 Gbits/sec receiver
[ 7] 0.00-30.00 sec 10.8 GBytes 3.10 Gbits/sec 0 sender
[ 7] 0.00-30.00 sec 10.8 GBytes 3.10 Gbits/sec receiver
[ 9] 0.00-30.00 sec 10.2 GBytes 2.92 Gbits/sec 2 sender
[ 9] 0.00-30.00 sec 10.2 GBytes 2.92 Gbits/sec receiver
[ 11] 0.00-30.00 sec 10.5 GBytes 3.00 Gbits/sec 2 sender
[ 11] 0.00-30.00 sec 10.5 GBytes 3.00 Gbits/sec receiver
[SUM] 0.00-30.00 sec 41.6 GBytes 11.9 Gbits/sec 5 sender
[SUM] 0.00-30.00 sec 41.6 GBytes 11.9 Gbits/sec receiver
$ iperf3 --time 30 --parallel 8 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 4.81 GBytes 1.38 Gbits/sec 1 sender
[ 5] 0.00-30.00 sec 4.81 GBytes 1.38 Gbits/sec receiver
[ 7] 0.00-30.00 sec 5.47 GBytes 1.57 Gbits/sec 1 sender
[ 7] 0.00-30.00 sec 5.47 GBytes 1.57 Gbits/sec receiver
[ 9] 0.00-30.00 sec 5.40 GBytes 1.55 Gbits/sec 1 sender
[ 9] 0.00-30.00 sec 5.40 GBytes 1.55 Gbits/sec receiver
[ 11] 0.00-30.00 sec 5.11 GBytes 1.46 Gbits/sec 1 sender
[ 11] 0.00-30.00 sec 5.11 GBytes 1.46 Gbits/sec receiver
[ 13] 0.00-30.00 sec 5.33 GBytes 1.53 Gbits/sec 0 sender
[ 13] 0.00-30.00 sec 5.33 GBytes 1.53 Gbits/sec receiver
[ 15] 0.00-30.00 sec 5.07 GBytes 1.45 Gbits/sec 2 sender
[ 15] 0.00-30.00 sec 5.07 GBytes 1.45 Gbits/sec receiver
[ 17] 0.00-30.00 sec 5.40 GBytes 1.55 Gbits/sec 0 sender
[ 17] 0.00-30.00 sec 5.40 GBytes 1.54 Gbits/sec receiver
[ 19] 0.00-30.00 sec 4.81 GBytes 1.38 Gbits/sec 2 sender
[ 19] 0.00-30.00 sec 4.81 GBytes 1.38 Gbits/sec receiver
[SUM] 0.00-30.00 sec 41.4 GBytes 11.9 Gbits/sec 8 sender
[SUM] 0.00-30.00 sec 41.4 GBytes 11.9 Gbits/sec receiver
The CPU load with parallel iperf3 is the same - parallelism 4 = 4 cpus at 100%, parallelism 8 = 8 cpus at 100% etc.
Can you please share the crash dump?
Unfortunately I didn't manage to get it :( not even a stack trace.
large-receive-offload off suspicious. Also RX checksumming looks off. Can you record session with Wireshark? Just to check the TCP packet sizes on receive. With functional RSC (receive side coalescing, that should boost RX performance) we should see 64K packets (or in any case larger than MTU packets)
@ybendito - ideas?
Confirm that RSC is not working. Wireshark inside guest shows 1500/9000 sized packets depending on host mtu configuration.
on host:
$ sudo ethtool -K vnet4 tso on
Actual changes:
tx-tcp-segmentation: off [requested on]
tx-tcp-ecn-segmentation: off [requested on]
tx-tcp-mangleid-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]
Could not change any device features
$ ethtool -c vnet4
Coalesce parameters for vnet4:
Adaptive RX: n/a TX: n/a
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a
rx-usecs: n/a
rx-frames: 0
rx-usecs-irq: n/a
rx-frames-irq: n/a
tx-usecs: n/a
tx-frames: n/a
tx-usecs-irq: n/a
tx-frames-irq: n/a
rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a
rx-usecs-high: n/a
rx-frame-high: n/a
tx-usecs-high: n/a
tx-frame-high: n/a
CQE mode RX: n/a TX: n/a
Is that what you meant?
tcp-segmentation-offload should be "on" for the tap device.
Yeah, I've tried setting that using ethtool but it isn't turning on though, where do you suggest I look at next?
Do you see anything in "dmesg"?
Nothing printed there, either.
Yep. I think something's wrong with the TSO on this kernel/configuration. But in your case even with TSO off you're still going above 10gbit/s; I'm barely hitting 1gbit/s.
I'm not sure where to look to figure out why TSO isn't turning on though.
I've been digging the kernel code and the qemu code - I can confirm that tap devices can turn on TSO, just not the ones that are currently in use by the VMs/created by libvirt/qemu. @ybendito could you share your domain and network libvirt XML please? It looks like libvirt and qemu both have a role to play here in setting the tap device correctly. My bridge is created manually, but I tried a different domain with a network created by libvirt and TSO is still off there.
@lowjoel My results are from plain command line qemu, no libvirt, just -tap,vhost=on,id=..,script= in the command line. Fedora 28, qemu ~6.1, kernel 5.12
could you paste that here and I'll try that as a minimal reproducer please? including how the tap is created? just in case I'm missing something
@lowjoel Enjoy ) sudo /home/yurib/src/qemu/build/qemu-system-x86_64 -machine q35,accel=kvm --snapshot --trace events=/home/yurib/qemu-events-en -cpu SandyBridge,+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_relaxed,hv_vapic,hv_time -m 8192 -smp 4 -uuid 1534fa42-4818-4493-9f67-eee5ba758385 -no-user-config -nodefaults -no-hpet -monitor stdio -device ioh3420,bus=pcie.0,id=root0,chassis=1,addr=0xa.0 -device ioh3420,bus=pcie.0,id=root1,chassis=2,addr=0xb.0 -device ioh3420,bus=pcie.0,id=root2,chassis=3,addr=0xc.0 -device ioh3420,bus=pcie.0,id=root3,chassis=4,addr=0xd.0 -device ioh3420,bus=pcie.0,id=root4,chassis=5,addr=0xe.0 -device ioh3420,bus=pcie.0,id=root5,chassis=6,addr=0xf.0 -global ICH9-LPC.disable_s3=0 -global ICH9-LPC.disable_s4=1 -device ahci,id=ahci -device virtio-serial-pci,bus=root1,id=virtio-serial0,max_ports=4,iommu_platform=on,ats=on -chardev spicevmc,name=vdagent,id=vdagent -device virtserialport,nr=2,bus=virtio-serial0.0,chardev=vdagent,name=com.redhat.spice.0 -chardev socket,id=serialp2,host=0.0.0.0,port=50000,server=on,wait=no -device virtserialport,nr=1,bus=virtio-serial0.0,chardev=serialp2,name=test.0 -netdev tap,id=hostnet10sb,script=/home/yurib/br0-ifup,ifname=nw10sb,vhost=on -device virtio-net-pci,netdev=hostnet10sb,mac=04:54:13:05:10:38,bus=root0,id=poc2,rss=on -device virtio-balloon-pci,bus=root4,iommu_platform=on,ats=on -drive file=/images/vms/2019-q35-usb.qcow2,if=none,id=drive-ide-3,media=disk,format=qcow2,cache=unsafe -device ide-hd,drive=drive-ide-3,id=ide3,bus=ahci.0,bootindex=0 -drive file=/images/iso/ubuntu-18.04.1-desktop-amd64.iso,if=none,id=drive-cd,media=cdrom,format=raw -device qemu-xhci,p2=8,p3=8 -device usb-tablet -device usb-storage,drive=drive-cd,id=xx3,bootindex=1 -vga std -vnc :1 -chardev spicevmc,name=usbredir,id=usbredirchardev1 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev1,id=usbredirdev1 -chardev spicevmc,name=usbredir,id=usbredirchardev2 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev2,id=usbredirdev2 -chardev spicevmc,name=usbredir,id=usbredirchardev3 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev3,id=usbredirdev3 -boot menu=on
@ybendito and how was the tap created? on my fresh Ubuntu install using both ip tuntap add mode tap pi vnet_hdr
and using tunctl
both create tap devices that I can't enable TSO on. I'm guessing you aren't seeing that same behaviour on your Fedora machine?
@lowjoel qemu creates the tap (it rus as an admin). When created, qemu runs script as defined in script=/home/yurib/br0-ifup, the script is:
switch=virbr0 ifconfig $1 promisc 0.0.0.0 brctl addif ${switch} $1
virbr0 is the libvirt bridge (so the device is behind local NAT)
Let's see what happens under libvirt:
The RSC works.
What says your powershell on the guest?
Bingo. It wasn't the host side, it's the guest side.
Initially:
> get-netadapterrsc | format-list
Name : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled : True
IPv6Enabled : True
IPv4Supported : True
IPv6Supported : True
IPv4OperationalState : False
IPv6OperationalState : False
IPv4FailureReason : WFPCompatibility
IPv6FailureReason : WFPCompatibility
WFP is the Windows Filtering Platform. I guess it's the firewall. Disabled the firewall:
> get-netadapterrsc | format-list
Name : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled : True
IPv6Enabled : True
IPv4Supported : True
IPv6Supported : True
IPv4OperationalState : True
IPv6OperationalState : True
IPv4FailureReason : NoFailure
IPv6FailureReason : NoFailure
Wireshark now shows ~62k sized packets.
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 23.6 GBytes 6.77 Gbits/sec 4 sender
[ 5] 0.00-30.00 sec 23.6 GBytes 6.77 Gbits/sec receiver
It's not 1:1 with sending, but 30% less than sending. I'll take it.
Incidentally, after disabling the firewall:
Features for vnet4:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
I didn't expect that the guest can affect the host in this way. Can I help to update the wiki/docs as a form of expressing my thanks? 😄 I don't have permissions though. I will also reach out to the firewall vendor to ask.
I didn't expect that the guest can affect the host in this way
Guest is one who requests to enable/disable these options on the host tap. If the driver has started with RSC enabled, it can dynamically turn it on/off (Qemu configures the tap accordingly) and if the guest turned the RSC on we can turn it off/on in the tap. But if the guest started the device with RSC disabled - this means that the OS is not ready to receive coalesced packets (packet size > MTU), it this case you can't turn it on in the tap. Fortunately.
Incidentally, for those who are seeing this:
> get-netadapterrsc | format-list
Name : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled : True
IPv6Enabled : True
IPv4Supported : False
IPv6Supported : False
IPv4OperationalState : False
IPv6OperationalState : False
IPv4FailureReason : Capability
IPv6FailureReason : Capability
It means that your libvirt/qemu command line is not enabling any of the offloads. Try the follow under the
<driver name="vhost" txmode="iothread" ioeventfd="on" event_idx="on" queues="4" rx_queue_size="1024" tx_queue_size="1024">
<host csum="on" gso="on" tso4="on" tso6="on" ecn="on" ufo="on" mrg_rxbuf="on"/>
<guest csum="on" tso4="on" tso6="on" ecn="on" ufo="on"/>
</driver>
As promised @YanVugenfirer @ybendito I've updated the wiki with the knowledge from this thread: https://github.com/lowjoel/kvm-guest-drivers-windows-wiki/compare/netkvm-rsc-docs
Please feel free to integrate the updated docs into the wiki. And also feel free to close this issue since the problem is not with the netkvm driver. Thank you all once again for helping me!
@lowjoel Thanks for the Wiki update!
Just for the statistics - can you tell us why you are testing performance and how you are using the Virtio drivers?
No problem. I have a workstation/server all-in-one setup at home. I use a Windows guest since I'm mostly familiar with it, but on the server side at $DAYJOB I'm more familiar with the Linux stack. The server's just a file server, and I have shares across the host/guest which is why I ran into this specific problem.
I was testing performance because that specific share had my photos on it and transferring them for editing/publishing was unbearably slow 😅
Thanks!
Describe the bug
iperf3 can send ~10gbit/s to the host from the guest on a single connection:
But less than 10% of that performance when receiving from the host:
Copying a file over the bridge but between 2 Windows VMs gives me ~1.6Gbit/s and doesn't experience the same issue.
To Reproduce Steps to reproduce the behaviour:
My Windows iperf3 is on WSL1, so there's no Hyper-V layer in between (but I get to run iperf3). See https://github.com/virtio-win/kvm-guest-drivers-windows/issues/1026#issuecomment-1892173927 for iperf3 using cygwin (no WSL)
I have got a few workarounds:
--parallel
for iperf. Or use SMB multichannel. Guest CPU is highly loaded during iperf run depending on parallelism (parallelism 4 = 4 loaded CPUs, parallelism 8 = 8 loaded CPUs). Notice how adding more parallelism has diminishing returns.Expected behavior Send and receive performance should be similar. Maybe not 1:1 but <10% the performance shows something else is wrong here.
Screenshots If applicable, add screenshots to help explain your problem.
Host:
(8 queues, 16 core machine).
VM:
Additional context
There is a bridge interface on the host, and a tap interface for the Windows guest.
I saw this doc: https://github.com/virtio-win/kvm-guest-drivers-windows/wiki/netkvm-RSC-(receive-segment-coalescing)-feature:
notice that
tcp-segmentation-offload: off
for the vnet device. Not sure if that's related.