Open bvital1976 opened 6 years ago
Have you tried to disable IOMMU?
Centos 7.3 system becomes unusable with disabled IOMMU.
The same crash occurs on Centos 7.3 with kernel 3.10.0-693.11.1.el7.x86_64 running on VMWare Workstation.
The last commit which does not crash the OS is:
Commit: b2e99d984f890f74fa1a9cbf57ff43121f575a19 [b2e99d9] Parents: 191aab0174 Author: Giuseppe Lettieri g.lettieri@iet.unipi.it Date: 15 марта 2017 г. 18:45:15 Committer: Giuseppe Lettieri Commit Date: 5 октября 2017 г. 17:12:49
all followed commits crash Centos 7.3 on VMWare while using e1000
So you are saying this commit is the cause:
commit 546c4e63c9a8c32fd64f6799ebeff5e993d8c6db
Author: Giuseppe Lettieri <giuseppe.lettieri@unipi.it>
Date: Fri Oct 6 13:50:16 2017 +0200
linux: fix broken dma-mapping
?
Maybe @giuseppelettieri has a clue on this, it should be related to netmap_load_map
.
What pkt-gen command line are you using exactly?
Yes, this commit is the first which crashes OS.
I used the following command: "pkt-gen -i eth2 -f tx"
The scheme we are using involves a pre-allocation of all the DMA-mappings, which is infeasible for SW-IOMMU. The crash is probably on the error path after the allocation failure. The bug needs to be fixed, of course, but you will still be unable to use the patched e1000 driver in your setup. Anyway, using the unpatched driver with the generic netmap-layer should make no difference for e1000.
Patched e1000 driver does not work in ESXi, VMWare Workstation and VirtualBox VMs. Using unpatched driver with pkt-gen gives me ~70Kpps, while patched driver gives ~500Kpps in my environment. I think this is big difference in performance.
It makes a big difference because you are running inside a vm, so the e1000 nic is emulated in software, and i/o register access is very expensive. Since with the generic driver you pay a register access per packet this is very costly (with patched driver you pay one register access per batch).
@giuseppelettieri meant that on a real e1000 card you wouldn't see significant difference, because i/o register access is way cheaper on a real pci bus.
It seems that e1000 isn't a good solution for you. Have you tried to use virtio-net, or maybe vmxnet? Vmware is not open source, so we don't have good solutions for that (while we have good solutions for qemu and bhyve).
virtio-net is not available for VMWare. vmxnet - do you mean I have to patch vmxnet sources to allow netmap using it? non-patched vmxnet gives me below 40Kpps
I'm surprised that VMWare doesn't support virtio (which is the de facto standard for VM networking). Yes, I think the only solution for your use-case would be to patch vmxnet sources to allow netmap to use it. I don't see other solutions because your virtualization environment is very constrained. Using netmap on the unpatched vmxnet vNIC has the very same problems as e1000 (costly per-packet I/O register access), so I'm not surprised you get 40Kpps.
It crashes in any virtual environment I tried. E.g. it crashes on VirtualBox on Centos 7.4.1708, with kernel 3.10.0-693.11.1.el7.x86_64 latest netmap master with the following stack trace:
[ 113.925253] [
Yes, as @giuseppelettieri is saying, the crash is probably on the error path, and it's happening because you are using sw iommu. We need to fix it, but that won't let you use the system. At least virtio-net is supported by virtualbox or qemu: have you tried to use that instead of e1000?
No, because my main target is ESXi. So the only choice for me is patching vmxnet3.
Hi @bvital1976, here is some other option for your use case.
Be aware that SW-IOMMU means that the kernel is allocating a bounce buffer for each one of the pre-allocated netmap buffers. The allocation fails and you get the error.
However, you may not need that many extra buffers in your case. Try to change them with something like
echo 5000 > /sys/module/netmap/parameters/buf_num
(try to find a value that make your application start successfully.)
Please note that bounce buffers do not work currently, since we are not doing the the necessary copies. They will work after we merge #411 and then apply the change also to the e1000 driver.
Your suggestion helps. Now the system does not crash, pkt-gen works but my application does not work most of the time. Host rings and TX ring usually work, RX rings rarely work. Is this because bounce buffers are not complete?
It depends on the kind of error. Are you receving all-null packets? If yes, that is because of the missing copy from the bounce buffers.
Yes, I receive all-null packets with 2048 bytes size
May you please try the current master?
It works with "echo 24576 > /sys/module/netmap/parameters/buf_num" on VMWare Workstation It freezes Centos with different buf_num on ESXi using e1000. e1000e works fine on ESXi.
We have done several fixes in this area, so the bug should be gone. Could you please give a try to confirm that?
The bug is still there. I tried Centos on ESXi. Sometimes pkt-gen works, sometimes it hangs the system, sometimes it does not work because the program cannot allocate memory. I tried different buf_num.
It is ok that pkt-gen returns ENOMEM in case there are not enough bounce buffers. That's not a bug. The hang however is worrisome... does it happen with specific values of buf_num? Is it deterministic or not?
It hangs with different buf_num values. It does not hang always. Sometimes it works correctly, sometimes it tries to send packets but cannot send packets (reports 0 sent packets), sometimes it hangs. The VM in case of hang uses 100% of CPUs
ok, and you don't have an update kernel stack trace of the hang, if any? If you share the exact steps to reproduce the issue we could try to reproduce it.
no, I do not have a kernel stack trace. I had to reset ESXi host since it did not respond to commands. I tried on two different ESXi hosts.
To reproduce the issue:
Hi @bvital1976 , may please try again with the latest master?
It still hangs. It hangs from the first attempt with "-l 1500". It does not hang without "-l" switch.
I am using Centos 7.3 VM running inside ESXi, which is itself running in VMWare Workstation. Centos 7.3 with kernel 3.10.0-693.5.2.el7.x86_64 crashes on starting pkt-gen when using e1000 virtual network card. If I try e1000e virtual network card then pkt-gen works fine. e1000 source code correctly fixed as described in https://github.com/luigirizzo/netmap#348. Before crash Linux displays message "DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:02:00.0". Unpatched e1000 driver works fine. The crash message is: