Closed simonhf closed 10 years ago
I unfortunately have run things only in a virtual machine (I do not own any suitable hardware), so I cannot provide any useful numbers.
There are two main aspects to performance 1) how well does a rump kernel run the NetBSD kernel TCP/IP stack outside of the kernel 2) how good is the NetBSD TCP/IP stack. For "1", there are a bunch of todo points for optimizing performance on the wiki. I believe that overhead can totally be optimized away. The reason why it isn't yet is because for years the work was driven by generality instead of performance. So it's essentially a question of removing some code ;-) https://github.com/anttikantee/dpdk-rumptcpip/wiki/Optimizing-performance
Overall, performance is combination of how well you do packet processing and how well you do packet shoveling (and how well these two play with each other). The approach in [1] proposes uses completely different approach to packet shoveling from that of DPDK. After that it's just a matter of what packet processing method you attach to the shovel ... and the portable, full-featured open source options seem to be in short supply.
hi anttikantee, My company have some HP DL380 with Intel 82599 that had run DPDK applications. We have interest in DPDK + RUMP performance test. What testes you want to do? We can do them then send the test result to you.
Hi lxu4net, thanks for the offer. I think any numbers characterizing a networking stack would be interesting, e.g. throughput, latency, connections per second, etc. However, before publishing any numbers, I really would like to experiment with the simple optimization suggestions cited above, as I think they could bring great performance improvements with relatively little effort.
Hi, anttikantee, I have read "Optimizing-performance" several times and read some source codes about NetBSD TCP/IP stack. I found there a giant kernel lock in NetBSD TCP/IP stack. So I guess the RUMP instance should be pinned at a single physical core and requires multiple RUMP instances to achieve high performance. The multi -queues of NIC can be easily mapped to multiple RUMP instances. But how about multi-process applications that talk about multiple RUMP instances in socket API? Such as nginx, multiple processes simultaneously accept client connections on same socket. Does it require a lot of modification on socket layer?
Your observations are correct.
Running a packet processing stack per core has an advantage that any multicore processing model cannot beat: there is no need for atomic locks after a thread is running in the rump kernel (see sys/rump/librump/rumpkern/locks_up.c). Additionally, if we use a non-preemptive thread scheduler, there is no need for locking even when entering the rump kernel. This probably does not scale to dozens of cores (management/configuration becomes too difficult), but it should work very nicely for e.g. 8 cores.
Yes, I think some type of modifications to the application interface and the applications are required for high performance. If we assume modifications that remove explicit data copy with application <-> socket buffer, they should work the same regardless of one client process or many client processes -- the same memory will be mapped to the process hosting the client and the process hosting the TCP/IP stack, and then it's just a matter of being able to do a wakeup.
Oh, and removing the giant networking lock from NetBSD is in progress (rmind-smpnet in the NetBSD repo).
But regardless, my theory is that stack-per-core will give the best possible performance.
There's now a mailing list, let's continue this discussion there if necessary. https://lists.sourceforge.net/lists/listinfo/rumpkernel-users
What does the performance look like? And how does it compare to e.g. the performance reported in "Why protocol stacks should be in user-space?" [1].
[1] http://www.ietf.org/proceedings/87/slides/slides-87-mptcp-1.pdf