mtcp-stack / mtcp

mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems
Other
2k stars 436 forks source link

Bottleneck about Nginx/mTCP in multi-process mode #33

Closed bandari closed 8 years ago

bandari commented 8 years ago

Hi,

I am doing a research project. It's Nginx + mTCP + DPDK. I use the latest version of mtcp/dpdk code. In order to run Nginx in multi-process mode, I fixed some bugs in DPDK, modified some code in mTCP and Nginx. At present, fork() is supported. So Nginx+mTCP+DPDK can work normally. But I‘ve encountered a difficult problem, which that the performance of Nginx will not be promoted in the multi-core case.

So I wrote a test program. it used mTCP/DPDK. and run in server side, It used fork() to create some child processes. Each child process run in a separate core, received and sent date in a separate RSS tx/rx queue. It just count the number of connection requests completed per second. I wrote another test program which run in client. It just requested tcp connections, then close those connections.

I fount an interesting phenomenon.For example, If 2 child processes were created, each child process could accept about 132,000 connections per second. If 4 child processes were created,each child process could accept about 66,000 connections per second. If 8 child processes were create, each child process could accept about 33,000 connections per second.

So, It seems that no matter how many child processes created, the total number of connections per second is constant. It feels like that there is an invisible bottleneck which limits the expansion of multicore.

I did all the test in a server which CPU is intel Xeon E5, it has 2 physical CPUs and 24 cores. 200+ GB RAM (4 memory channels) The network Interface card is Intel dual port 82599 10 GbE NIC OS is rhel 7.1

By the way, I didn't use the FDIR function in RSS. Does it have any effect on multi-core performance?

Can someone give me some advice ?

jagsnn commented 8 years ago

Looking at the description of your problem, it looks like there is some counter that is restricting the no. of connections that could be processed to 1,32,000. So no matter how many threads you run, that counter will limit to you to that much only. It could be some thread counts, or buffer allocation limits or some other. You may have to really dig the code and see who is restricting the count.

If you were hitting at bottle necks, then you would not see such exact no's., but you may actually see a behaviour of the peformance no's varying by some +/- 5% atleast. The percentage I mentioned as an approximate value and could vary based on the resource and the bottleneck it creates when you try to scale your performance.

This is what I think could be happening !

jagsnn commented 8 years ago

Looking at the no. could it be the hashtable max no. of entries possible limit that you are hitting. check CreateHashtable() ?

define NUM_BINS (131072) /* 132 K entries per thread*/

I suppose that relates to the no. of TCP connections, but even if the hashtable was created before the fork, how can the child processes be limited by this 132K, as each one of them has their own address space. Are you by any chance using shared memory between the child processes.

This is a bit strange !

bandari commented 8 years ago

hi, jagsnn

Thank you very much for your attentions. I have followed your advices and multiplied the number of NUM_BINS by eight. But there was no change in the test results. And in the test program, I do not use any shared memory among those child processes. But the child processes shared the hugepage memory in dpdk lib. The call trace is rte_eal_init -> rte_eal_memory_init The master process calls rte_eal_hugepage_init in rtel_eal_memory_init, while child processes call rte_eal_hugepage_attach.

In fact, the test program was modified based on the apps/example/epserver.c in mTCP. I did the test in both multi-process mode and multi-thread mode. But the test results were almost the same.

In the Issue, I forgot to say something important. If one child process was created, it could accept about 187000 connections per second. If two child processes was created, each child process could accept about 132,000 connections per second. But it is not a precise value, the number of connecitons fluctuated within 131000 and 133900. Thereafter, If the number of child processes doubled, the number of each child process' connections per second would be reduced by half. It could be understood that the whole number of connections has an upper limit.

Let me introduce the mtcp initialization process in multi-process mode. The master process' type in DPDK is PRIMARY while all child processes' type in DPDK is SECONDARY.

The master process calls mtcp_init() and mtcp_core_affinitize(), then calls fork() to create some child processes.

Each child process calls mtcp_init(), mtcp_core_affinitize(), mtcp_create_context(), mtcp_epoll_create(), mtcp_socket(), mtcp_setsock_nonblock(), mtcp_bind() and mtcp_listen().

Then each child process blocks in mtcp_epoll_wait and waiting for new connections. All the child processes and mtcp theads corresponding to are running on different cores. So, each child process has its own data structure in mtcp, for example g_mtcp. Each mtcp thread receives and sends data in a independent queue.

So, this is the mtcp initialization and operation way in multi-process mode. I am not sure if it's going to be a problem. But the performance of program running in mtcp multi-process mode is indeed not very good.

jagsnn commented 8 years ago

Since the dpdk initializes the primary process alone, and the children share the huge pages initialized, I wonder if some limit imposed by the dpdk init values, because thats the one thing shared between the process, but that is just a wild guess.

In case this is not tried before, it would be better to turn on the mtcp trace functions, as whatever is defined in debug. Enable with -D option in gcc, the error traces and further any debugs as required. I think this should give a clue on which function is limiting further connections. If that doesn't give a clue try turning on the dpdk debugs defined in config/common_linuxapp as required, rebuild the dpdk target, as well as your app, and give a try.

When you say performance, is it the no. of tcp connections per second that you measuring ?

-Jags

vincentmli commented 8 years ago

Hi bandari

there is a backlog limit in mtcp/src/include/mtcp.h to limit the connection queue :

define BACKLOG_SIZE (10*1024)

and in mtcp/src/core.c: InitializeMTCPManager

mtcp->connectq = CreateStreamQueue(BACKLOG_SIZE);

I recall I have mtcp client connection bottleneck with that backlog setting, but not sure if it is related to your mtcp server connection bottleneck, you could try increasing the value. also you can enable mtcp debug in mtcp/src/Makefile and uncomment the DBG_OPT, it may give you some clue, I used to have debug option below, you can add or remove some debug option for your needs if it is getting too noise:

DBG_OPT = -DDBGMSG -DPKTDUMP -DDBGFUNC -DSTREAM -DSTATE -DTSTAT -DAPP -DEPOLL -DAPI

Vincent

bandari commented 8 years ago

Hi, I am really sorry for the late reply of this issus. As a matter of fact, the performance of mtcp is scalable. Because the pressure from client is not enough, the scalability is not shown. I use the LoadRunner tool instead of the test program written by myself, the scalability shows up immediately.

Thank you very much for all your advices.