Open jiridanek opened 2 years ago
From my flamegraph analysis the real overhead with sys_mutex() is the memory allocation overhead for the sys_mutex_t wrapper object (look a qd_message() in the flamegraph below):
The actual setup of the pthread mutex is very cheap. So a a very first step I'd recommend you're suggestion of simply in-lining the mutex directly and avoiding the memory alloc completely.
Once you have a pull request for that re-run your vtune analysis and see what happens.
The actual setup of the pthread mutex is very cheap. So a a very first step I'd recommend you're suggestion of simply in-lining the mutex directly and avoiding the memory alloc completely.
They do look cheap in perf
, I guess that's why Justin did not pull it up yet. I already said I don't know why vtune likes to show them on top
In your flamegraph, that lock creation shows on 323 samples, let's calculate it as 4%. That means if my base message rate is as shown below, I can expect ~2k message improvement if I could do away with it.
>>> 53938.7/100*4
2157.548
base, some commit randomly taken from main, at the time I was running this few days ago, I can find out which commit, but I did not write it down
Estimated message rate is 53938.7 +- 507 msgs/second at 95% confidence
futex jd_2022_03_28_futex
Estimated message rate is 56925.3 +- 258 msgs/second at 95% confidence
(data from https://github.com/skupperproject/skupper-router/issues/260#issuecomment-1090390216, which includes explanation; the futex branch replaces pthread mutexes with custom implementation using futex() and atomic operations)
In fact, I get almost 3k improvement
>>> 56925.3-53938.7
2986.600000000006
If I consider confidence interval of the difference as the sum of the two intervals for the subtracted numbers (which somewhat makes sense, although I am not sure it is theoretically completely sound), I have +- 765.
Assuming we can get a 4% perf improvement, by doing mutexes differently, is it worth it?
Actually, I did not notice the sys_mutex
is there twice in the flamegraph
>>> 53938.7/100*7.23
3899.76801
So, up to 7.23% improvement possible in theory, assuming that all the time saved on creating locks goes into transferring more messages.
HI @jiridanek I won't measure the improvement on message rate like that, because it implies that 100% of the used cpu time is responsible of such rate, that's not possible, given that there are events on kernel space and NIC that won't appear on such graphs (off-cpu time due to IRQ handling). Think about a sleep call: it won't appear in on-cpu flamegraph (because the CPU isn't running), but if removed will speed-up the code a lot :)
Can mean the opposite as well: having less pressure on the memory and causing less fragmentation (the mutex are aligned too, but still very small and each aligned allocation is way more costy and become costy over time - something that won't show up on a flamegraph of a just few minutes long run) will benefit every other operation accessing memory.
In short; having the cost, unless you were running at the max IPC (without off-heap pauses or context switches), won't help to determine both low or high boundaries of expected improvements.
@franz1981 That makes sense. I think the only reason I went into the numbers is that I actually have that PR that replaces pthreads mutex with custom futex-based mutex which is very cheap to init (code from http://people.redhat.com/drepper/futex.pdf), and with that I get that "almost 3k" improvement in throughput, which turned out to be +5.53%. That is why I started to feel that the 7.23% may be a reasonable number. But I see that it's just accident that these turned out to be somehow comparable in magnitude.
Intel V-Tune dislikes the mutexes in the router code
I am not sure what the profiler is actually showing here, and how can I get same result (these functions shown at the top) from linux
perf
.Regarding solutions. I think proactor guarantees do not help here. Proactor ensures the program never processes the same connection (link?) in more than one thread. Messages are by the nature of the router's function associated with more than one connection.
@franz1981 once considered pooling the mutexes. That is, if each instance of
qd_message_t
, etc. needs a mutex, then create a pool of initialized mutexes, and upon message creation, take a mutex out for it; on destruction of message, leave mutex initialized and put it back to pool. https://github.com/apache/qpid-dispatch/pull/508#issuecomment-505461704There is alternative design, where you create an array of mutexes, choosing a suitable size N for it, probably based on number of worker threads, and some additional empiric parameters or constants. Then, you compute a hash of qd_message_t mod N. This is the index to the mutex in the array that will be protecting this message. Multiple unrelated qd_message_t's may now share a mutex, meaning that in some cases, two threads which were formerly free to work concurrently now cannot. Some sort of probing, as in hash tables, can be introduced on top, to minimize such collisions. Overall, this has a chance to produce speedup, however.
Finally, one possible speedup could be if mutex field in the structs was not a pointer but mutex was directly contained in the structs. When I tried this, I did not see the miraculous speedups I hoped for. Probably not worth pursuing it.
Edit: plus, of course using spinlocks as Franz tried, or possibly Linux futexes (that don't need to be initialized/freed) might also help.