OWRS Performance Enhancement

I ran smart's code on my testbed, which brought a huge performance boost. My testbed uses a connectx-6 NIC and two Intel(R) Xeon(R) Gold 5218 CPUs. When I wanted to replicate the performance gains of owrs, I wrote my own test code, which just posted depth wrs and poll all. But I couldn't get the same performance gain with my code. Here are the throughputs at 8byte using smart and my test code. smart post_and_poll

I turned off all optimization options for smart except thread_aware_alloc. And made my test code as close as possible to the qp optimization and owrs optimization that smart uses. But no matter what, I can't get similar performance improvement above 24 threads and above 8 depth. Can you give me some idea about the source of the performance improvement in smart. Here is the smart_config I am using. Also, my testing found that qp's allocation optimization on the doorbell register is not applied above 12 (which is the actual driver limit) (I turned off preload), but the smart code still gets a higher performance boost with more shared_uuar than 12. This is something I can't understand either.

{ "infiniband": { "name": "", "port": 1, "gid_idx": 1 },

"qp_param": { "max_cqe_size": 256, "max_wqe_size": 256, "max_sge_size": 1, "max_inline_data": 64 },

"max_nodes": 128, "initiator_cache_size": 4096,

"use_thread_aware_alloc": true, "thread_aware_alloc": { "total_uuar": 100, "shared_uuar": 96, "shared_cq": true },

"use_work_req_throt": false, "work_req_throt": { "initial_credit": 4, "max_credit": 12, "credit_step": 2, "execution_epochs": 60, "sample_cycles": 19200000, "inf_credit_weight": 1.05, "auto_tuning": false },

"use_conflict_avoidance": false, "use_speculative_lookup": false,

"experimental": { "qp_sharing": false } }

madsys-dev / smart

OWRS Performance Enhancement #3