netkvm: receive/transmit performance vastly different

lowjoel commented 5 months ago

Describe the bug

iperf3 can send ~10gbit/s to the host from the guest on a single connection:

$ iperf3 --time 30 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  34.3 GBytes  9.83 Gbits/sec  775435825             sender
[  5]   0.00-30.05  sec  34.3 GBytes  9.81 Gbits/sec                  receiver

But less than 10% of that performance when receiving from the host:

$ iperf3 --time 30 --reverse -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.06  sec  2.36 GBytes   675 Mbits/sec    0             sender
[  5]   0.00-30.00  sec  2.36 GBytes   674 Mbits/sec                  receiver

Copying a file over the bridge but between 2 Windows VMs gives me ~1.6Gbit/s and doesn't experience the same issue.

To Reproduce Steps to reproduce the behaviour:

My Windows iperf3 is on WSL1, so there's no Hyper-V layer in between (but I get to run iperf3). See https://github.com/virtio-win/kvm-guest-drivers-windows/issues/1026#issuecomment-1892173927 for iperf3 using cygwin (no WSL)

I have got a few workarounds:

Increase bridge and tap MTU to 9000.

$ iperf3 --time 30 --reverse -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.04  sec  8.13 GBytes  2.32 Gbits/sec    0             sender
[  5]   0.00-30.00  sec  8.12 GBytes  2.33 Gbits/sec                  receiver

Use --parallel for iperf. Or use SMB multichannel. Guest CPU is highly loaded during iperf run depending on parallelism (parallelism 4 = 4 loaded CPUs, parallelism 8 = 8 loaded CPUs). Notice how adding more parallelism has diminishing returns.

$ iperf3 --time 30 --reverse --parallel 4 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.04  sec  1.66 GBytes   474 Mbits/sec  101             sender
[  5]   0.00-30.00  sec  1.66 GBytes   474 Mbits/sec                  receiver
[  7]   0.00-30.04  sec   907 MBytes   253 Mbits/sec  145             sender
[  7]   0.00-30.00  sec   903 MBytes   252 Mbits/sec                  receiver
[  9]   0.00-30.04  sec  1.37 GBytes   392 Mbits/sec   68             sender
[  9]   0.00-30.00  sec  1.37 GBytes   392 Mbits/sec                  receiver
[ 11]   0.00-30.04  sec  1.40 GBytes   400 Mbits/sec   45             sender
[ 11]   0.00-30.00  sec  1.40 GBytes   399 Mbits/sec                  receiver
[SUM]   0.00-30.04  sec  5.31 GBytes  1.52 Gbits/sec  359             sender
[SUM]   0.00-30.00  sec  5.30 GBytes  1.52 Gbits/sec                  receiver
$ iperf3 --time 30 --reverse --parallel 8 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.04  sec  1.51 GBytes   431 Mbits/sec  1449             sender
[  5]   0.00-30.00  sec  1.50 GBytes   431 Mbits/sec                  receiver
[  7]   0.00-30.04  sec   913 MBytes   255 Mbits/sec  982             sender
[  7]   0.00-30.00  sec   909 MBytes   254 Mbits/sec                  receiver
[  9]   0.00-30.04  sec   906 MBytes   253 Mbits/sec  1271             sender
[  9]   0.00-30.00  sec   902 MBytes   252 Mbits/sec                  receiver
[ 11]   0.00-30.04  sec   884 MBytes   247 Mbits/sec  1174             sender
[ 11]   0.00-30.00  sec   880 MBytes   246 Mbits/sec                  receiver
[ 13]   0.00-30.04  sec   974 MBytes   272 Mbits/sec  1462             sender
[ 13]   0.00-30.00  sec   971 MBytes   272 Mbits/sec                  receiver
[ 15]   0.00-30.04  sec   890 MBytes   249 Mbits/sec  1324             sender
[ 15]   0.00-30.00  sec   887 MBytes   248 Mbits/sec                  receiver
[ 17]   0.00-30.04  sec   914 MBytes   255 Mbits/sec  1316             sender
[ 17]   0.00-30.00  sec   910 MBytes   255 Mbits/sec                  receiver
[ 19]   0.00-30.04  sec  1.68 GBytes   481 Mbits/sec  1281             sender
[ 19]   0.00-30.00  sec  1.68 GBytes   480 Mbits/sec                  receiver
[SUM]   0.00-30.04  sec  8.54 GBytes  2.44 Gbits/sec  10259             sender
[SUM]   0.00-30.00  sec  8.51 GBytes  2.44 Gbits/sec                  receiver

Both:

$ iperf3 --time 30 --reverse --parallel 4 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.04  sec  5.71 GBytes  1.63 Gbits/sec    0             sender
[  5]   0.00-30.00  sec  5.70 GBytes  1.63 Gbits/sec                  receiver
[  7]   0.00-30.04  sec  5.72 GBytes  1.64 Gbits/sec    0             sender
[  7]   0.00-30.00  sec  5.71 GBytes  1.64 Gbits/sec                  receiver
[  9]   0.00-30.04  sec  6.07 GBytes  1.73 Gbits/sec    0             sender
[  9]   0.00-30.00  sec  6.06 GBytes  1.74 Gbits/sec                  receiver
[ 11]   0.00-30.04  sec  6.07 GBytes  1.74 Gbits/sec    0             sender
[ 11]   0.00-30.00  sec  6.07 GBytes  1.74 Gbits/sec                  receiver
[SUM]   0.00-30.04  sec  23.6 GBytes  6.74 Gbits/sec    0             sender
[SUM]   0.00-30.00  sec  23.5 GBytes  6.74 Gbits/sec                  receiver
$ iperf3 --time 30 --reverse --parallel 8 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.02  sec  4.07 GBytes  1.16 Gbits/sec  357             sender
[  5]   0.00-30.00  sec  4.06 GBytes  1.16 Gbits/sec                  receiver
[  7]   0.00-30.02  sec  3.87 GBytes  1.11 Gbits/sec  464             sender
[  7]   0.00-30.00  sec  3.86 GBytes  1.11 Gbits/sec                  receiver
[  9]   0.00-30.02  sec  3.88 GBytes  1.11 Gbits/sec  568             sender
[  9]   0.00-30.00  sec  3.87 GBytes  1.11 Gbits/sec                  receiver
[ 11]   0.00-30.02  sec  4.33 GBytes  1.24 Gbits/sec  578             sender
[ 11]   0.00-30.00  sec  4.33 GBytes  1.24 Gbits/sec                  receiver
[ 13]   0.00-30.02  sec  4.64 GBytes  1.33 Gbits/sec    0             sender
[ 13]   0.00-30.00  sec  4.64 GBytes  1.33 Gbits/sec                  receiver
[ 15]   0.00-30.02  sec  4.65 GBytes  1.33 Gbits/sec    0             sender
[ 15]   0.00-30.00  sec  4.65 GBytes  1.33 Gbits/sec                  receiver
[ 17]   0.00-30.02  sec  4.20 GBytes  1.20 Gbits/sec  426             sender
[ 17]   0.00-30.00  sec  4.20 GBytes  1.20 Gbits/sec                  receiver
[ 19]   0.00-30.02  sec  4.17 GBytes  1.19 Gbits/sec  588             sender
[ 19]   0.00-30.00  sec  4.16 GBytes  1.19 Gbits/sec                  receiver
[SUM]   0.00-30.02  sec  33.8 GBytes  9.67 Gbits/sec  2981             sender
[SUM]   0.00-30.00  sec  33.8 GBytes  9.67 Gbits/sec                  receiver

Expected behavior Send and receive performance should be similar. Maybe not 1:1 but <10% the performance shows something else is wrong here.

Screenshots If applicable, add screenshots to help explain your problem.

Host:

Distro: [e.g. Fedora, Ubuntu, Proxmox] Ubuntu 22.04
Kernel version Linux 6.5 (HWE)
QEMU version qemu 6.2
QEMU command line

/usr/bin/qemu-system-x86_64 ... \
-accel kvm \
-cpu host,migratable=off,topoext=on,svm=on,invtsc=on,x2apic=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x1fff,hv-vpindex=on,hv-runtime=on,hv-synic=on,hv-stimer=on,hv-stimer-direct=on,hv-reset=on,hv-vendor-id=1234567890ab,hv-frequencies=on,hv-reenlightenment=on,hv-tlbflush=on,hv-ipi=on,kvm=off,host-cache-info=on,l3-cache=off ... \
-netdev tap,fds=58:62:63:64:65:66:69:70,id=hostnet0,vhost=on,vhostfds=71:72:73:74:75:76:77:78 \
-device virtio-net-pci,tx=bh,ioeventfd=on,event_idx=on,csum=on,gso=on,host_tso4=on,host_tso6=on,host_ecn=on,host_ufo=on,mrg_rxbuf=on,guest_csum=on,guest_tso4=on,guest_tso6=on,guest_ecn=on,guest_ufo=on,mq=on,vectors=18,rx_queue_size=1024,tx_queue_size=1024,netdev=hostnet0,id=net0,mac=<snip>,bus=pci.1,addr=0x0.0x7 ...

(8 queues, 16 core machine).

libvirt version 8.0.0

libvirt XML file

<interface type='bridge'>
  <mac address='SNIP'/>
  <source bridge='SNIP'/>
  <model type='virtio'/>
  <driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='on' queues='8' rx_queue_size='1024' tx_queue_size='1024'>
    <host csum='on' gso='on' tso4='on' tso6='on' ecn='on' ufo='on' mrg_rxbuf='on'/>
    <guest csum='on' tso4='on' tso6='on' ecn='on' ufo='on'/>
  </driver>
  <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x7'/>
</interface>

VM:

Windows version Windows 10 22H2
Which driver has a problem using netkvm version 100.93.104.24000
Driver version or commit hash that was used to build the driver

Additional context

There is a bridge interface on the host, and a tap interface for the Windows guest.

I saw this doc: https://github.com/virtio-win/kvm-guest-drivers-windows/wiki/netkvm-RSC-(receive-segment-coalescing)-feature:

$ ethtool -k BRIDGE
Features for BRIDGE:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [requested on]
tx-fcoe-segmentation: off [requested on]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-tunnel-remcsum-segmentation: on
tx-sctp-segmentation: off [requested on]
tx-esp-segmentation: on
tx-udp-segmentation: off [requested on]
tx-gso-list: off [requested on]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]
$  ethtool -k vnet49
Features for vnet49:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

notice that tcp-segmentation-offload: off for the vnet device. Not sure if that's related.

YanVugenfirer commented 5 months ago

I suggest to ran as a benchmark first iperf in the guest without WSL1. If there are issues, we will dig in. But if it is WSL tap issues, we can at best give some advices for that to look.

Best regards, Yan.

YanVugenfirer commented 5 months ago

Another important comment: please run test with one stream. WSL1 definitely is not supporting multi-queue.

lowjoel commented 5 months ago

Wait, there's Windows binaries for iperf? Haha. I'll try that.

iperf is actually to make it more reproducible. What started this was my copies over Samba being slow from Host to Guest.

lowjoel commented 5 months ago

OK I've had to flip the server/client (run the iperf client on the host, server in the guest), but the results are the same. I used this binary without WSL: https://iperf.fr/iperf-download.php

MTU=1500

$  iperf3 --time 30 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  2.43 GBytes   695 Mbits/sec    0             sender
[  5]   0.00-30.00  sec  2.43 GBytes   695 Mbits/sec                  receiver
$ iperf3 --time 30 --parallel 4 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  2.26 GBytes   647 Mbits/sec    0             sender
[  5]   0.00-30.00  sec  2.26 GBytes   646 Mbits/sec                  receiver
[  7]   0.00-30.00  sec  2.18 GBytes   624 Mbits/sec    0             sender
[  7]   0.00-30.00  sec  2.18 GBytes   623 Mbits/sec                  receiver
[  9]   0.00-30.00  sec  2.45 GBytes   700 Mbits/sec    0             sender
[  9]   0.00-30.00  sec  2.44 GBytes   699 Mbits/sec                  receiver
[ 11]   0.00-30.00  sec  2.51 GBytes   719 Mbits/sec    0             sender
[ 11]   0.00-30.00  sec  2.51 GBytes   718 Mbits/sec                  receiver
[SUM]   0.00-30.00  sec  9.40 GBytes  2.69 Gbits/sec    0             sender
[SUM]   0.00-30.00  sec  9.38 GBytes  2.69 Gbits/sec                  receiver
$ iperf3 --time 30 --parallel 8 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  1.10 GBytes   315 Mbits/sec    0             sender
[  5]   0.00-30.00  sec  1.10 GBytes   315 Mbits/sec                  receiver
[  7]   0.00-30.00  sec  1.80 GBytes   517 Mbits/sec    0             sender
[  7]   0.00-30.00  sec  1.80 GBytes   516 Mbits/sec                  receiver
[  9]   0.00-30.00  sec  2.07 GBytes   594 Mbits/sec    0             sender
[  9]   0.00-30.00  sec  2.07 GBytes   592 Mbits/sec                  receiver
[ 11]   0.00-30.00  sec  2.06 GBytes   591 Mbits/sec    0             sender
[ 11]   0.00-30.00  sec  2.06 GBytes   590 Mbits/sec                  receiver
[ 13]   0.00-30.00  sec  1.08 GBytes   310 Mbits/sec    0             sender
[ 13]   0.00-30.00  sec  1.08 GBytes   309 Mbits/sec                  receiver
[ 15]   0.00-30.00  sec  1.08 GBytes   309 Mbits/sec    1             sender
[ 15]   0.00-30.00  sec  1.07 GBytes   308 Mbits/sec                  receiver
[ 17]   0.00-30.00  sec  1.10 GBytes   314 Mbits/sec    1             sender
[ 17]   0.00-30.00  sec  1.09 GBytes   313 Mbits/sec                  receiver
[ 19]   0.00-30.00  sec  2.13 GBytes   610 Mbits/sec    0             sender
[ 19]   0.00-30.00  sec  2.13 GBytes   609 Mbits/sec                  receiver
[SUM]   0.00-30.00  sec  12.4 GBytes  3.56 Gbits/sec    2             sender
[SUM]   0.00-30.00  sec  12.4 GBytes  3.55 Gbits/sec                  receiver

I had a kernel panic when first running parallelism=8. Second run was OK. There's still nonlinear scaling.

MTU=9000

$  iperf3 --time 30 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  17.2 GBytes  4.94 Gbits/sec    0             sender
[  5]   0.00-30.00  sec  17.2 GBytes  4.94 Gbits/sec                  receiver
$ iperf3 --time 30 --parallel 4 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  10.1 GBytes  2.88 Gbits/sec    1             sender
[  5]   0.00-30.00  sec  10.1 GBytes  2.88 Gbits/sec                  receiver
[  7]   0.00-30.00  sec  10.8 GBytes  3.10 Gbits/sec    0             sender
[  7]   0.00-30.00  sec  10.8 GBytes  3.10 Gbits/sec                  receiver
[  9]   0.00-30.00  sec  10.2 GBytes  2.92 Gbits/sec    2             sender
[  9]   0.00-30.00  sec  10.2 GBytes  2.92 Gbits/sec                  receiver
[ 11]   0.00-30.00  sec  10.5 GBytes  3.00 Gbits/sec    2             sender
[ 11]   0.00-30.00  sec  10.5 GBytes  3.00 Gbits/sec                  receiver
[SUM]   0.00-30.00  sec  41.6 GBytes  11.9 Gbits/sec    5             sender
[SUM]   0.00-30.00  sec  41.6 GBytes  11.9 Gbits/sec                  receiver
$ iperf3 --time 30 --parallel 8 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  4.81 GBytes  1.38 Gbits/sec    1             sender
[  5]   0.00-30.00  sec  4.81 GBytes  1.38 Gbits/sec                  receiver
[  7]   0.00-30.00  sec  5.47 GBytes  1.57 Gbits/sec    1             sender
[  7]   0.00-30.00  sec  5.47 GBytes  1.57 Gbits/sec                  receiver
[  9]   0.00-30.00  sec  5.40 GBytes  1.55 Gbits/sec    1             sender
[  9]   0.00-30.00  sec  5.40 GBytes  1.55 Gbits/sec                  receiver
[ 11]   0.00-30.00  sec  5.11 GBytes  1.46 Gbits/sec    1             sender
[ 11]   0.00-30.00  sec  5.11 GBytes  1.46 Gbits/sec                  receiver
[ 13]   0.00-30.00  sec  5.33 GBytes  1.53 Gbits/sec    0             sender
[ 13]   0.00-30.00  sec  5.33 GBytes  1.53 Gbits/sec                  receiver
[ 15]   0.00-30.00  sec  5.07 GBytes  1.45 Gbits/sec    2             sender
[ 15]   0.00-30.00  sec  5.07 GBytes  1.45 Gbits/sec                  receiver
[ 17]   0.00-30.00  sec  5.40 GBytes  1.55 Gbits/sec    0             sender
[ 17]   0.00-30.00  sec  5.40 GBytes  1.54 Gbits/sec                  receiver
[ 19]   0.00-30.00  sec  4.81 GBytes  1.38 Gbits/sec    2             sender
[ 19]   0.00-30.00  sec  4.81 GBytes  1.38 Gbits/sec                  receiver
[SUM]   0.00-30.00  sec  41.4 GBytes  11.9 Gbits/sec    8             sender
[SUM]   0.00-30.00  sec  41.4 GBytes  11.9 Gbits/sec                  receiver

The CPU load with parallel iperf3 is the same - parallelism 4 = 4 cpus at 100%, parallelism 8 = 8 cpus at 100% etc.

YanVugenfirer commented 5 months ago

Can you please share the crash dump?

lowjoel commented 5 months ago

Unfortunately I didn't manage to get it :( not even a stack trace.

YanVugenfirer commented 5 months ago

large-receive-offload off suspicious. Also RX checksumming looks off. Can you record session with Wireshark? Just to check the TCP packet sizes on receive. With functional RSC (receive side coalescing, that should boost RX performance) we should see 64K packets (or in any case larger than MTU packets)

@ybendito - ideas?

lowjoel commented 5 months ago

Confirm that RSC is not working. Wireshark inside guest shows 1500/9000 sized packets depending on host mtu configuration.

ybendito commented 5 months ago

https://github.com/virtio-win/kvm-guest-drivers-windows/wiki/netkvm-RSC-(receive-segment-coalescing)-feature

lowjoel commented 5 months ago

on host:

$ sudo ethtool -K vnet4 tso on
Actual changes:
tx-tcp-segmentation: off [requested on]
tx-tcp-ecn-segmentation: off [requested on]
tx-tcp-mangleid-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]
Could not change any device features
$ ethtool -c vnet4
Coalesce parameters for vnet4:
Adaptive RX: n/a  TX: n/a
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: n/a
rx-frames: 0
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: n/a
tx-frames: n/a
tx-usecs-irq: n/a
tx-frames-irq: n/a

rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a

rx-usecs-high: n/a
rx-frame-high: n/a
tx-usecs-high: n/a
tx-frame-high: n/a

CQE mode RX: n/a  TX: n/a

Is that what you meant?

YanVugenfirer commented 5 months ago

tcp-segmentation-offload should be "on" for the tap device.

lowjoel commented 5 months ago

Yeah, I've tried setting that using ethtool but it isn't turning on though, where do you suggest I look at next?

YanVugenfirer commented 5 months ago

Do you see anything in "dmesg"?

lowjoel commented 5 months ago

Nothing printed there, either.

ybendito commented 5 months ago

lowjoel commented 5 months ago

Yep. I think something's wrong with the TSO on this kernel/configuration. But in your case even with TSO off you're still going above 10gbit/s; I'm barely hitting 1gbit/s.

I'm not sure where to look to figure out why TSO isn't turning on though.

lowjoel commented 5 months ago

I've been digging the kernel code and the qemu code - I can confirm that tap devices can turn on TSO, just not the ones that are currently in use by the VMs/created by libvirt/qemu. @ybendito could you share your domain and network libvirt XML please? It looks like libvirt and qemu both have a role to play here in setting the tap device correctly. My bridge is created manually, but I tried a different domain with a network created by libvirt and TSO is still off there.

ybendito commented 5 months ago

@lowjoel My results are from plain command line qemu, no libvirt, just -tap,vhost=on,id=..,script= in the command line. Fedora 28, qemu ~6.1, kernel 5.12

lowjoel commented 5 months ago

could you paste that here and I'll try that as a minimal reproducer please? including how the tap is created? just in case I'm missing something

ybendito commented 5 months ago

@lowjoel Enjoy ) sudo /home/yurib/src/qemu/build/qemu-system-x86_64 -machine q35,accel=kvm --snapshot --trace events=/home/yurib/qemu-events-en -cpu SandyBridge,+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_relaxed,hv_vapic,hv_time -m 8192 -smp 4 -uuid 1534fa42-4818-4493-9f67-eee5ba758385 -no-user-config -nodefaults -no-hpet -monitor stdio -device ioh3420,bus=pcie.0,id=root0,chassis=1,addr=0xa.0 -device ioh3420,bus=pcie.0,id=root1,chassis=2,addr=0xb.0 -device ioh3420,bus=pcie.0,id=root2,chassis=3,addr=0xc.0 -device ioh3420,bus=pcie.0,id=root3,chassis=4,addr=0xd.0 -device ioh3420,bus=pcie.0,id=root4,chassis=5,addr=0xe.0 -device ioh3420,bus=pcie.0,id=root5,chassis=6,addr=0xf.0 -global ICH9-LPC.disable_s3=0 -global ICH9-LPC.disable_s4=1 -device ahci,id=ahci -device virtio-serial-pci,bus=root1,id=virtio-serial0,max_ports=4,iommu_platform=on,ats=on -chardev spicevmc,name=vdagent,id=vdagent -device virtserialport,nr=2,bus=virtio-serial0.0,chardev=vdagent,name=com.redhat.spice.0 -chardev socket,id=serialp2,host=0.0.0.0,port=50000,server=on,wait=no -device virtserialport,nr=1,bus=virtio-serial0.0,chardev=serialp2,name=test.0 -netdev tap,id=hostnet10sb,script=/home/yurib/br0-ifup,ifname=nw10sb,vhost=on -device virtio-net-pci,netdev=hostnet10sb,mac=04:54:13:05:10:38,bus=root0,id=poc2,rss=on -device virtio-balloon-pci,bus=root4,iommu_platform=on,ats=on -drive file=/images/vms/2019-q35-usb.qcow2,if=none,id=drive-ide-3,media=disk,format=qcow2,cache=unsafe -device ide-hd,drive=drive-ide-3,id=ide3,bus=ahci.0,bootindex=0 -drive file=/images/iso/ubuntu-18.04.1-desktop-amd64.iso,if=none,id=drive-cd,media=cdrom,format=raw -device qemu-xhci,p2=8,p3=8 -device usb-tablet -device usb-storage,drive=drive-cd,id=xx3,bootindex=1 -vga std -vnc :1 -chardev spicevmc,name=usbredir,id=usbredirchardev1 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev1,id=usbredirdev1 -chardev spicevmc,name=usbredir,id=usbredirchardev2 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev2,id=usbredirdev2 -chardev spicevmc,name=usbredir,id=usbredirchardev3 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev3,id=usbredirdev3 -boot menu=on

lowjoel commented 5 months ago

@ybendito and how was the tap created? on my fresh Ubuntu install using both ip tuntap add mode tap pi vnet_hdr and using tunctl both create tap devices that I can't enable TSO on. I'm guessing you aren't seeing that same behaviour on your Fedora machine?

ybendito commented 5 months ago

@lowjoel qemu creates the tap (it rus as an admin). When created, qemu runs script as defined in script=/home/yurib/br0-ifup, the script is:

switch=virbr0 ifconfig $1 promisc 0.0.0.0 brctl addif ${switch} $1

virbr0 is the libvirt bridge (so the device is behind local NAT)

Let's see what happens under libvirt: The RSC works.

What says your powershell on the guest?

lowjoel commented 5 months ago

Bingo. It wasn't the host side, it's the guest side.

Initially:

> get-netadapterrsc | format-list

Name                 : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled          : True
IPv6Enabled          : True
IPv4Supported        : True
IPv6Supported        : True
IPv4OperationalState : False
IPv6OperationalState : False
IPv4FailureReason    : WFPCompatibility
IPv6FailureReason    : WFPCompatibility

WFP is the Windows Filtering Platform. I guess it's the firewall. Disabled the firewall:

> get-netadapterrsc | format-list

Name                 : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled          : True
IPv6Enabled          : True
IPv4Supported        : True
IPv6Supported        : True
IPv4OperationalState : True
IPv6OperationalState : True
IPv4FailureReason    : NoFailure
IPv6FailureReason    : NoFailure

Wireshark now shows ~62k sized packets.

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  23.6 GBytes  6.77 Gbits/sec    4             sender
[  5]   0.00-30.00  sec  23.6 GBytes  6.77 Gbits/sec                  receiver

It's not 1:1 with sending, but 30% less than sending. I'll take it.

Incidentally, after disabling the firewall:

Features for vnet4:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on

I didn't expect that the guest can affect the host in this way. Can I help to update the wiki/docs as a form of expressing my thanks? 😄 I don't have permissions though. I will also reach out to the firewall vendor to ask.

ybendito commented 5 months ago

I didn't expect that the guest can affect the host in this way

Guest is one who requests to enable/disable these options on the host tap. If the driver has started with RSC enabled, it can dynamically turn it on/off (Qemu configures the tap accordingly) and if the guest turned the RSC on we can turn it off/on in the tap. But if the guest started the device with RSC disabled - this means that the OS is not ready to receive coalesced packets (packet size > MTU), it this case you can't turn it on in the tap. Fortunately.

lowjoel commented 5 months ago

Incidentally, for those who are seeing this:

> get-netadapterrsc | format-list

Name                 : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled          : True
IPv6Enabled          : True
IPv4Supported        : False
IPv6Supported        : False
IPv4OperationalState : False
IPv6OperationalState : False
IPv4FailureReason    : Capability
IPv6FailureReason    : Capability

It means that your libvirt/qemu command line is not enabling any of the offloads. Try the follow under the definition (for qemu-kvm):

<driver name="vhost" txmode="iothread" ioeventfd="on" event_idx="on" queues="4" rx_queue_size="1024" tx_queue_size="1024">
    <host csum="on" gso="on" tso4="on" tso6="on" ecn="on" ufo="on" mrg_rxbuf="on"/>
    <guest csum="on" tso4="on" tso6="on" ecn="on" ufo="on"/>
</driver>

lowjoel commented 5 months ago

As promised @YanVugenfirer @ybendito I've updated the wiki with the knowledge from this thread: https://github.com/lowjoel/kvm-guest-drivers-windows-wiki/compare/netkvm-rsc-docs

Please feel free to integrate the updated docs into the wiki. And also feel free to close this issue since the problem is not with the netkvm driver. Thank you all once again for helping me!

YanVugenfirer commented 5 months ago

@lowjoel Thanks for the Wiki update!

Just for the statistics - can you tell us why you are testing performance and how you are using the Virtio drivers?

lowjoel commented 5 months ago

No problem. I have a workstation/server all-in-one setup at home. I use a Windows guest since I'm mostly familiar with it, but on the server side at $DAYJOB I'm more familiar with the Linux stack. The server's just a file server, and I have shares across the host/guest which is why I ran into this specific problem.

I was testing performance because that specific share had my photos on it and transferring them for editing/publishing was unbearably slow 😅

YanVugenfirer commented 5 months ago

Thanks!

virtio-win / kvm-guest-drivers-windows

netkvm: receive/transmit performance vastly different #1026