Performance followup - Githubissues

Alexander-Shukaev commented 10 months ago

Hi @gdamore,

Happy holidays and many thanks for addressing #1663, really appreciate, we're definitely on the right track as I can confirm that the throughput roughly doubled. Now however, I would like to follow up with some more analysis that bothers me.

I have the following 4 test case runs (each of them is run in a separate process and not in parallel so that there is no side load on the system):

2 pairs REQ/REP talking via inproc.
4 pairs REQ/REP talking via inproc.
2 pairs REQ/REP talking via ipc.
4 pairs REQ/REP talking via ipc.

By "pair" I mean that each 2 participants talk privately to each other exclusively (so there is no case where one REP would have 2 or more REQs, i.e. 1-to-1 only pairs). But pairs themselves execute in parallel within one test case (process). Now, just from definitions of cases my expectations would be:

high throughput in all cases, for each pair.
throughput can gradually but slowly decrease if there are 4 pairs (or even more, i.e. with increasing number of pairs of actors) running in one process due to natural saturation of the system resource (CPU, communication, etc.).
throughput cannot drastically drop just with 4 pairs (instead of 2) unless there are some "global" relations/interlockings between pairs of (seemingly independent) actors inside NNG due to some implementation details, what would obviously be not OK.
inproc substantially or otherwise visibly more preferment than ipc.

Somehow, I observe results that contradict every expectation. Let's start with inproc:

INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:2
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:1
INFO: Exchanged 200000 messages (451045 KiB) in 11.408 seconds: 17531.000 m/s (39537.000 KiB/s)
INFO: Exchanged 200000 messages (451416 KiB) in 11.592 seconds: 17253.000 m/s (38940.000 KiB/s)
INFO: Exchanged 400000 messages (902462 KiB) in 2 pairs: 34784 m/s (78477 KiB/s)

INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:1
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:3
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:2
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:4
INFO: Exchanged 200000 messages (452012 KiB) in 21.176 seconds: 9444.000 m/s (21345.000 KiB/s)
INFO: Exchanged 200000 messages (451840 KiB) in 21.141 seconds: 9460.000 m/s (21373.000 KiB/s)
INFO: Exchanged 200000 messages (451219 KiB) in 21.252 seconds: 9411.000 m/s (21232.000 KiB/s)
INFO: Exchanged 200000 messages (451414 KiB) in 21.271 seconds: 9402.000 m/s (21222.000 KiB/s)
INFO: Exchanged 800000 messages (1806487 KiB) in 4 pairs: 37717 m/s (85171 KiB/s)

`inproc`: 2 pairs	`inproc`: 4 pairs

Observations:

Very low (especially for inproc implementation) overall throughput of ~35K m/s
Very low (especially for inproc implementation) individual pair throughput of ~17K m/s
Changing from 2 pairs to 4 pairs leads to halving of throughput and as a result doubling of time needed to finish exchange of 200K messages for each pair as if there would be some interference/relation between them...

Now let's see how ipc is doing:

INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/82991:1
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/82991:2
INFO: Exchanged 200000 messages (452363 KiB) in 17.902 seconds: 11172.000 m/s (25269.000 KiB/s)
INFO: Exchanged 200000 messages (451481 KiB) in 18.008 seconds: 11106.000 m/s (25071.000 KiB/s)
INFO: Exchanged 400000 messages (903845 KiB) in 2 pairs: 22278 m/s (50340 KiB/s)

INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:1
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:2
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:3
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:4
INFO: Exchanged 200000 messages (452072 KiB) in 9.987 seconds: 20026.000 m/s (45267.000 KiB/s)
INFO: Exchanged 200000 messages (451981 KiB) in 10.686 seconds: 18715.000 m/s (42295.000 KiB/s)
INFO: Exchanged 200000 messages (451480 KiB) in 10.758 seconds: 18591.000 m/s (41967.000 KiB/s)
INFO: Exchanged 200000 messages (451711 KiB) in 10.805 seconds: 18509.000 m/s (41805.000 KiB/s)
INFO: Exchanged 800000 messages (1807245 KiB) in 4 pairs: 75841 m/s (171333 KiB/s)

`ipc`: 2 pairs	`ipc`: 4 pairs

Observations:

Very low overall throughput of ~22K m/s
Very low individual pair throughput of ~11K m/s
Changing from 2 pairs to 4 pairs leads to doubling of throughput (😨) and as a result halving of time needed to finish exchange of 200K messages for each pair (overall throughput quadrupling essentially) as if there would be some boosting/relation between them...

So to summarize:

inproc performance of the whole system degrades with a number of pairs of actors. (unusable basically).
ipc performance of the whole system grows with a number of pairs of actors. (NOTE: with 16 pairs I am close to ~300K m/s).
ipc starts to outperform inproc (as a result of the above 2 items) with 4 or more pairs, I cannot explain this...
base performance for individual pairs for both of them is quite low (under 20K m/s). My expectation would be to reach 50K-100K m/s and I'm quite sure that's doable as that's what we have with home-grown acking protocol on top of TCP/IP stack.

Each flame graph is uploaded in the form of SVG and is interactive (fgs.zip). You can browse each branch in a regular web browser (e.g. Chromium or Firefox). You probably want to look into nni_taskq_thread branch for further analysis. To me it appears that there is a lot of interlocking happening in NNG, especially how else can we explain rapid degradation in inproc case, where seemingly independent pairs affect each other somehow.

Appreciate your efforts and looking forward to improve NNG even further together, I'd be glad to test any improvements or other nuances for you.

Wish you happy holidays 🎅

Environment Details

master
3.10.0-1160.80.1.el7.x86_64
GNU 11.2.1 C/C++
Static library

gdamore commented 10 months ago

Is there anyway you can post versions of those flamegraphs that are either interactive, or are widened so that I can the symbol names?

Alexander-Shukaev commented 10 months ago

Just updated my post with an archive upload Sir, looks like GitHub does not allow them to be interactive.

gdamore commented 10 months ago

Thank you.

So about doubling etc. With REQ/REP there is a full round trip. The throughput etc. may be latency dominated. So adding more pairs should increase performance until you saturate some other resource (either the network bandwidth available, the CPU resources, or something else like memory/bus speed -- the latter two of these are very dependent on the internal details of NNG as well -- what locks we use, etc.)

There are some hot locks that you are probably hitting on, although the flame graphs above don't suggest that they are the limiters. (The lock protecting the hash of socket IDs to socket structs for example.)

There are shared resources as well -- both REQ and REP share a few things:

The set of threads used for the backend poller (on Linux it's just one right now, but on Windows it's scaled to the number of cores)
The set of threads used for completion callbacks (typically something like 8, or ncpu, depends on your compilation flags)
On a single socket, contexts share the locks for the socket (almost all the interesting activity happens under that lock) that owns them.
There is typically just a single TCP connection/IPC connection between the peers. Keeping this full, but not overfull, is critical to obtaining best performance.

I don't think your flamegraphs uncover the time spent not doing work, because we are waiting for a response. That's the issue in the 4th item, and it might be dominant.

A few ideas though.

Consider multiple sockets instead of multiple contexts. Contexts are meant to make things easier, but you wind up hitting those common locks. (Its probably easy to open multiple client sockets, and hard to open multiple server sockets. So start with the clients.)
For inproc especially the friction to opening another socket should be very low indeed.
See if you can measure latency. What would be good to know is the end to end latency between to peers. For example, if the latency is 1msec over the network (which would be high) that will limit you to 1000 msgs/sec for a single peer, but you could get up to 500 with 2 peers. Do not discount the impact of latency. (For inproc the latency should be close to zero... measured in <1us, probably on the order of 100ns but I haven't measured it.) You'll want to measure the processing time on the server, vs. the total latency observed by the client, and subtrace the two to get the actual network latency.

So the latency and keeping the pipe full may lead to some surprises where multiple clients improve performance.

More info as I think on it.

gdamore commented 10 months ago

Looking at inproc, I think it was designed for safety, and in particular has some guards to prevent recursive lock entry. As a result, we wind up not "completing" jobs inline that we might otherwise be able to, which forces us to go the taskq to break the lock cycle.

I do believe that there is some opportunity to improve this -- in particular it seems like some of the locking betweeen the readers and writers and could be broken by completing the aios outside of the lock. I have done this for some other transports but haven't paid much attention to inproc.

Partly I don't think of people doing much performance critical work with inproc. So it hasn't gotten the attention that some of the other transports have (particularly for performance). I usually assume that folks who have performance critical needs and could use inproc would just use function calls. :-)

If you're on UNIX / LInux, you might want to look at the socket:// transport I just integrated. It's easier to use than ipc (no filesystem paths it uses socketpair()) but should have performance very close to ipc:// because it is backed almost identically. (There are two differences -- we don't write the extra flags byte for socket:// so that saves a tiny bit, and we use write() rather than sendmsg(), which probably won't be noticeable. I could change that if we found a reason to do so.)

nanomsg / nng

Performance followup #1737