mtcp-stack / mtcp

mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems
Other
2k stars 436 forks source link

Significant slowdown with server-side processing #202

Open andreaskipf opened 6 years ago

andreaskipf commented 6 years ago

Hi!

We've integrated mTCP into a simple application and experience a significant slowdown in throughput with vs. without server-side processing.

The slowdown is much more severe in case of mTCP than with regular TCP:

mTCP: Without any processing (just a count++ to avoid compiler optimizations) we're able to ingest ~5M events/s (10Gbit Ethernet link) using 256 TCP clients. With processing (we sleep for 100ns after receiving an event) we measured ~25k events/s.

TCP: Without any processing: ~700k events/s With processing: ~65k events/s

As you can see from these numbers, mTCP experiences a much more significant slowdown than TCP. When profiling the mTCP version, we see several kernel calls (entry_SYSCALL_64_fastpath, entry_SYSCALL_64, __schedule, etc.). This suggests that there's some scheduling issue. Did you experience something similar when integrating mTCP into other applications?

Thanks, Andreas

ajamshed commented 6 years ago

@andreaskipf,

I am not sure whether you are simulating the workload fairly. With the 'TCP' version, when you sleep, you are still letting OS handle network connections and general I/O operations (e.g. read/write, epoll I/O operations). On the other hand, if you pause (via sleep()) operations in the app layer in case of 'mTCP', you are stealing cycles from the app layer to process such operations: there are operations that are executed in the app thread underneath such calls. The __schedule function is most likely appearing you profile due to the excessive sleep calls that triggers frequent scheduling of threads in your system.

andreaskipf commented 6 years ago

Thanks for your fast response.

I've tried replacing sleep() with 100 rand() calls (and summing up/printing the results to avoid optimizations) and the behavior is still the same. mTCP significantly suffers from the processing while TCP's performance gradually decreases.

From what I understand, TCP can still receive incoming packets (in the OS) while mTCP is blocked for a certain amount of time before it will receive packets again. Is that the case or is there a separate mTCP thread in user space that receives packets from the NIC?

Here's the relevant part of my server code:

while (true) {
    while ((n_fds = mtcp_epoll_wait(mctx, epollfd, events, MAX_EVENTS, -1))==0);
    for (int curr_event = 0; curr_event < n_fds; ++curr_event) {
      if (events[curr_event].data.sockid==socketFd) {
        // Handle new client ...
      } else {
        n = mtcp_read(mctx,
                      events[curr_event].data.sockid,
                      (char*) buff,
                      BUFF_SIZE - 1);
        if (n < 0) {
          cerr << "Error reading from socket!" << endl;
          perror(NULL);
          cleanupAndExit(mctx, socketFd);
          std::cout << sum << std::endl;
        }
        // Do some work.
        sum += randn(100);
        // Send response ...
      }
    }
  }

mTCP profile (w/ rand() instead of sleep() there are at least no more __schedule calls):

mtcp

TCP profile:

tcp
andreaskipf commented 6 years ago

Seems like in case of mTCP one thread yields most of the time. Here are some more performance metrics:

mTCP: time sec, cycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock, scale, IPC, CPUs,
GHz 6.390435, 55389.761520, 55133.634730, 42.853380, 0.392260, 46.777290, 19476.180640, 100000, 0.995376, 0.304771, 2. 843975

TCP: time sec, cycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz 2.460383, 78737.757570, 72879.545480, 90.321080, 77.285960, 139.320890, 24603.870950, 100000, 0.925598, 1.000002, 3.200218

mTCP has a CPU utilization of 30% and TCP of 100%.

ajamshed commented 6 years ago

There is a separate mTCP thread in user space that handles rx/tx communication with the NIC.

It looks like you are making the application extremely top-heavy.... meaning the bottleneck is not the networking stack but it is the application logic (excessive rand() calls). But this still does not explain performance degradation of the mTCP version.

mTCP-dpdk uses poll mode driver to receive packets. This means that, unlike 'TCP' version, mTCP application schedules mTCP thread to read packets from the NIC more frequently (when compared to 'TCP' version...which uses NAPI-based Ethernet driver). Can you try one of the two following tests:

1- Run mTCP-netmap version and see if it improves the results. You can see README and README.netmap for details. Netmap driver uses NAPI-based Ethernet driver and this will improve the scheduling issue that you have highlighted.

OR

2- Run mTCP-dpdk with NAPI-based emulation. Please enable RX_IDLE_ENABLE macro from mtcp/src/dpdk_module.c. Re-compile mTCP and then your application. You can try varying RX_IDLE_THRESH (in mtcp/src/dpdk_module.c) to 16/32/64 to see if it improves results.

I would actually prefer you test out option (1).

andreaskipf commented 6 years ago

Thanks for your thorough reply, ajamshed.

I've tried both options. Here are the results (rand = number of rand() calls per event): results

RX_IDLE_ENABLE 8 lowered the overall time with 1000 rand() calls from 31s to 16s. netmap made things worse.

Here's the profile of netmap:

netmap

I have to admit that I only see ~3 Gbit/s in netmap's pkgen application when running both the client and the server on the same machine (which doesn't match the scenario above where the server runs singlethreaded mTCP and the clients are placed on a different machine and use TCP). I already tried disabling flow control and lro (large receive offload).

ajamshed commented 6 years ago

@andreaskipf:

Apologies for the delayed response. I see that you are a single-core application. And your concurrent connections value is also ~256. Can you please try increasing your concurrency levels to around 1024? I suspect that you are not overly stressing the mTCP version of the application (batching of requests help when the incoming requests rate is high enough). Also, please try increasing the parallelism in your tests. mTCP works best when you are running an n-core application. Please try to maintain at least 1024 concurrent connections per core as a general rule of thumb when dealing with mTCP stack.

andreaskipf commented 6 years ago

Yes, mtcp_epoll_wait() is currently executed from a single thread that also does the processing (rand()) and there are up to 256 TCP clients (connections). MAX_EVENTS is set to 1024.

With 1024 concurrent connections per core you mean that we should add more clients AND more processing threads (which call mtcp_epoll_wait() and do the rand() processing)?

Thanks!

ajamshed commented 6 years ago

You can write your client program so that they can create more client connections at any given time. You don't necessarily need to spawn so many TCP client processes. You can see epwget.c as a reference implementation. Although it is written using mTCP API, you can get the general idea on how you can write client application using BSD API.

Yes. You need to design your multi-threaded program in such a way that it follows a run-to-completion model. Again, please see epserver.c as your reference.