nanomsg / nng

nanomsg-next-generation -- light-weight brokerless messaging
https://nng.nanomsg.org
MIT License
3.86k stars 493 forks source link

Performance followup #1737

Open Alexander-Shukaev opened 11 months ago

Alexander-Shukaev commented 11 months ago

Hi @gdamore,

Happy holidays and many thanks for addressing #1663, really appreciate, we're definitely on the right track as I can confirm that the throughput roughly doubled. Now however, I would like to follow up with some more analysis that bothers me.

I have the following 4 test case runs (each of them is run in a separate process and not in parallel so that there is no side load on the system):

By "pair" I mean that each 2 participants talk privately to each other exclusively (so there is no case where one REP would have 2 or more REQs, i.e. 1-to-1 only pairs). But pairs themselves execute in parallel within one test case (process). Now, just from definitions of cases my expectations would be:

Somehow, I observe results that contradict every expectation. Let's start with inproc:

INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:2
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:1
INFO: Exchanged 200000 messages (451045 KiB) in 11.408 seconds: 17531.000 m/s (39537.000 KiB/s)
INFO: Exchanged 200000 messages (451416 KiB) in 11.592 seconds: 17253.000 m/s (38940.000 KiB/s)
INFO: Exchanged 400000 messages (902462 KiB) in 2 pairs: 34784 m/s (78477 KiB/s)
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:1
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:3
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:2
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:4
INFO: Exchanged 200000 messages (452012 KiB) in 21.176 seconds: 9444.000 m/s (21345.000 KiB/s)
INFO: Exchanged 200000 messages (451840 KiB) in 21.141 seconds: 9460.000 m/s (21373.000 KiB/s)
INFO: Exchanged 200000 messages (451219 KiB) in 21.252 seconds: 9411.000 m/s (21232.000 KiB/s)
INFO: Exchanged 200000 messages (451414 KiB) in 21.271 seconds: 9402.000 m/s (21222.000 KiB/s)
INFO: Exchanged 800000 messages (1806487 KiB) in 4 pairs: 37717 m/s (85171 KiB/s)
inproc: 2 pairs inproc: 4 pairs
inproc 2 inproc 4

Observations:

Now let's see how ipc is doing:

INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/82991:1
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/82991:2
INFO: Exchanged 200000 messages (452363 KiB) in 17.902 seconds: 11172.000 m/s (25269.000 KiB/s)
INFO: Exchanged 200000 messages (451481 KiB) in 18.008 seconds: 11106.000 m/s (25071.000 KiB/s)
INFO: Exchanged 400000 messages (903845 KiB) in 2 pairs: 22278 m/s (50340 KiB/s)
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:1
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:2
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:3
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:4
INFO: Exchanged 200000 messages (452072 KiB) in 9.987 seconds: 20026.000 m/s (45267.000 KiB/s)
INFO: Exchanged 200000 messages (451981 KiB) in 10.686 seconds: 18715.000 m/s (42295.000 KiB/s)
INFO: Exchanged 200000 messages (451480 KiB) in 10.758 seconds: 18591.000 m/s (41967.000 KiB/s)
INFO: Exchanged 200000 messages (451711 KiB) in 10.805 seconds: 18509.000 m/s (41805.000 KiB/s)
INFO: Exchanged 800000 messages (1807245 KiB) in 4 pairs: 75841 m/s (171333 KiB/s)
ipc: 2 pairs ipc: 4 pairs
ipc 2 ipc 4

Observations:

So to summarize:

Each flame graph is uploaded in the form of SVG and is interactive (fgs.zip). You can browse each branch in a regular web browser (e.g. Chromium or Firefox). You probably want to look into nni_taskq_thread branch for further analysis. To me it appears that there is a lot of interlocking happening in NNG, especially how else can we explain rapid degradation in inproc case, where seemingly independent pairs affect each other somehow.

Appreciate your efforts and looking forward to improve NNG even further together, I'd be glad to test any improvements or other nuances for you.

Wish you happy holidays 🎅

Environment Details

gdamore commented 11 months ago

Is there anyway you can post versions of those flamegraphs that are either interactive, or are widened so that I can the symbol names?

Alexander-Shukaev commented 11 months ago

Just updated my post with an archive upload Sir, looks like GitHub does not allow them to be interactive.

gdamore commented 11 months ago

Thank you.

So about doubling etc. With REQ/REP there is a full round trip. The throughput etc. may be latency dominated. So adding more pairs should increase performance until you saturate some other resource (either the network bandwidth available, the CPU resources, or something else like memory/bus speed -- the latter two of these are very dependent on the internal details of NNG as well -- what locks we use, etc.)

There are some hot locks that you are probably hitting on, although the flame graphs above don't suggest that they are the limiters. (The lock protecting the hash of socket IDs to socket structs for example.)

There are shared resources as well -- both REQ and REP share a few things:

  1. The set of threads used for the backend poller (on Linux it's just one right now, but on Windows it's scaled to the number of cores)
  2. The set of threads used for completion callbacks (typically something like 8, or ncpu, depends on your compilation flags)
  3. On a single socket, contexts share the locks for the socket (almost all the interesting activity happens under that lock) that owns them.
  4. There is typically just a single TCP connection/IPC connection between the peers. Keeping this full, but not overfull, is critical to obtaining best performance.

I don't think your flamegraphs uncover the time spent not doing work, because we are waiting for a response. That's the issue in the 4th item, and it might be dominant.

A few ideas though.

  1. Consider multiple sockets instead of multiple contexts. Contexts are meant to make things easier, but you wind up hitting those common locks. (Its probably easy to open multiple client sockets, and hard to open multiple server sockets. So start with the clients.)
  2. For inproc especially the friction to opening another socket should be very low indeed.
  3. See if you can measure latency. What would be good to know is the end to end latency between to peers. For example, if the latency is 1msec over the network (which would be high) that will limit you to 1000 msgs/sec for a single peer, but you could get up to 500 with 2 peers. Do not discount the impact of latency. (For inproc the latency should be close to zero... measured in <1us, probably on the order of 100ns but I haven't measured it.) You'll want to measure the processing time on the server, vs. the total latency observed by the client, and subtrace the two to get the actual network latency.

So the latency and keeping the pipe full may lead to some surprises where multiple clients improve performance.

More info as I think on it.

gdamore commented 10 months ago

Looking at inproc, I think it was designed for safety, and in particular has some guards to prevent recursive lock entry. As a result, we wind up not "completing" jobs inline that we might otherwise be able to, which forces us to go the taskq to break the lock cycle.

I do believe that there is some opportunity to improve this -- in particular it seems like some of the locking betweeen the readers and writers and could be broken by completing the aios outside of the lock. I have done this for some other transports but haven't paid much attention to inproc.

Partly I don't think of people doing much performance critical work with inproc. So it hasn't gotten the attention that some of the other transports have (particularly for performance). I usually assume that folks who have performance critical needs and could use inproc would just use function calls. :-)

If you're on UNIX / LInux, you might want to look at the socket:// transport I just integrated. It's easier to use than ipc (no filesystem paths it uses socketpair()) but should have performance very close to ipc:// because it is backed almost identically. (There are two differences -- we don't write the extra flags byte for socket:// so that saves a tiny bit, and we use write() rather than sendmsg(), which probably won't be noticeable. I could change that if we found a reason to do so.)

gdamore commented 2 weeks ago

I just had another look at those flame graphs.

Your code is apparently allocating and freeing nng_aio structures, and that is responsible for a large portion of the work.

You can avoid this in your application code.

Create one or more nng_aio structures ahead of time. Then use callback driven aio. This will reduce latency, reduce context switches, and completely eliminate the overhead of allocation / free.

Use nng_recv_aio and nng_send_aio if you can. You'll need to use nng_msg structures, but that will also avoid some extra data copying.