skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

[intel vtune] Consider pooling or sharding mutexes for some types; take proactor guarantees into consideration #252

Open jiridanek opened 2 years ago

jiridanek commented 2 years ago

Intel V-Tune dislikes the mutexes in the router code

sys_mutex

I am not sure what the profiler is actually showing here, and how can I get same result (these functions shown at the top) from linux perf.

Regarding solutions. I think proactor guarantees do not help here. Proactor ensures the program never processes the same connection (link?) in more than one thread. Messages are by the nature of the router's function associated with more than one connection.

@franz1981 once considered pooling the mutexes. That is, if each instance of qd_message_t, etc. needs a mutex, then create a pool of initialized mutexes, and upon message creation, take a mutex out for it; on destruction of message, leave mutex initialized and put it back to pool. https://github.com/apache/qpid-dispatch/pull/508#issuecomment-505461704

There is alternative design, where you create an array of mutexes, choosing a suitable size N for it, probably based on number of worker threads, and some additional empiric parameters or constants. Then, you compute a hash of qd_message_t mod N. This is the index to the mutex in the array that will be protecting this message. Multiple unrelated qd_message_t's may now share a mutex, meaning that in some cases, two threads which were formerly free to work concurrently now cannot. Some sort of probing, as in hash tables, can be introduced on top, to minimize such collisions. Overall, this has a chance to produce speedup, however.

Finally, one possible speedup could be if mutex field in the structs was not a pointer but mutex was directly contained in the structs. When I tried this, I did not see the miraculous speedups I hoped for. Probably not worth pursuing it.

Edit: plus, of course using spinlocks as Franz tried, or possibly Linux futexes (that don't need to be initialized/freed) might also help.

kgiusti commented 2 years ago

From my flamegraph analysis the real overhead with sys_mutex() is the memory allocation overhead for the sys_mutex_t wrapper object (look a qd_message() in the flamegraph below):

qdr_perf_worker_117881_117881

https://github.com/skupperproject/skupper-router/blob/9a20e6e04c7f4c8a1fa6012e84075f2f732af88b/src/posix/threading.c#L41

The actual setup of the pthread mutex is very cheap. So a a very first step I'd recommend you're suggestion of simply in-lining the mutex directly and avoiding the memory alloc completely.

Once you have a pull request for that re-run your vtune analysis and see what happens.

jiridanek commented 2 years ago

The actual setup of the pthread mutex is very cheap. So a a very first step I'd recommend you're suggestion of simply in-lining the mutex directly and avoiding the memory alloc completely.

They do look cheap in perf, I guess that's why Justin did not pull it up yet. I already said I don't know why vtune likes to show them on top

In your flamegraph, that lock creation shows on 323 samples, let's calculate it as 4%. That means if my base message rate is as shown below, I can expect ~2k message improvement if I could do away with it.

>>> 53938.7/100*4
2157.548
base, some commit randomly taken from main, at the time I was running this few days ago, I can find out which commit, but I did not write it down
Estimated message rate is 53938.7 +- 507 msgs/second at 95% confidence

futex jd_2022_03_28_futex
Estimated message rate is 56925.3 +- 258 msgs/second at 95% confidence

(data from https://github.com/skupperproject/skupper-router/issues/260#issuecomment-1090390216, which includes explanation; the futex branch replaces pthread mutexes with custom implementation using futex() and atomic operations)

In fact, I get almost 3k improvement

>>> 56925.3-53938.7
2986.600000000006

If I consider confidence interval of the difference as the sum of the two intervals for the subtracted numbers (which somewhat makes sense, although I am not sure it is theoretically completely sound), I have +- 765.

Assuming we can get a 4% perf improvement, by doing mutexes differently, is it worth it?

jiridanek commented 2 years ago

Actually, I did not notice the sys_mutex is there twice in the flamegraph

>>> 53938.7/100*7.23
3899.76801

So, up to 7.23% improvement possible in theory, assuming that all the time saved on creating locks goes into transferring more messages.

franz1981 commented 2 years ago

HI @jiridanek I won't measure the improvement on message rate like that, because it implies that 100% of the used cpu time is responsible of such rate, that's not possible, given that there are events on kernel space and NIC that won't appear on such graphs (off-cpu time due to IRQ handling). Think about a sleep call: it won't appear in on-cpu flamegraph (because the CPU isn't running), but if removed will speed-up the code a lot :)

Can mean the opposite as well: having less pressure on the memory and causing less fragmentation (the mutex are aligned too, but still very small and each aligned allocation is way more costy and become costy over time - something that won't show up on a flamegraph of a just few minutes long run) will benefit every other operation accessing memory.

In short; having the cost, unless you were running at the max IPC (without off-heap pauses or context switches), won't help to determine both low or high boundaries of expected improvements.

jiridanek commented 2 years ago

@franz1981 That makes sense. I think the only reason I went into the numbers is that I actually have that PR that replaces pthreads mutex with custom futex-based mutex which is very cheap to init (code from http://people.redhat.com/drepper/futex.pdf), and with that I get that "almost 3k" improvement in throughput, which turned out to be +5.53%. That is why I started to feel that the 7.23% may be a reasonable number. But I see that it's just accident that these turned out to be somehow comparable in magnitude.