Open minxinhao opened 6 months ago
patch/
directory) so that the number of shared_uuars
can exceed 12 (i.e., exactly the the input value MLX5_NUM_SHARED_UUARS
). The application uses the modified libibverbs by specifying the LD_PRELOAD
environment variable. You can try to use it in your program.
I ran smart's code on my testbed, which brought a huge performance boost. My testbed uses a connectx-6 NIC and two Intel(R) Xeon(R) Gold 5218 CPUs. When I wanted to replicate the performance gains of owrs, I wrote my own test code, which just posted depth wrs and poll all. But I couldn't get the same performance gain with my code. Here are the throughputs at 8byte using smart and my test code.
I turned off all optimization options for smart except thread_aware_alloc. And made my test code as close as possible to the qp optimization and owrs optimization that smart uses. But no matter what, I can't get similar performance improvement above 24 threads and above 8 depth. Can you give me some idea about the source of the performance improvement in smart. Here is the smart_config I am using. Also, my testing found that qp's allocation optimization on the doorbell register is not applied above 12 (which is the actual driver limit) (I turned off preload), but the smart code still gets a higher performance boost with more shared_uuar than 12. This is something I can't understand either.
{ "infiniband": { "name": "", "port": 1, "gid_idx": 1 },
"qp_param": { "max_cqe_size": 256, "max_wqe_size": 256, "max_sge_size": 1, "max_inline_data": 64 },
"max_nodes": 128, "initiator_cache_size": 4096,
"use_thread_aware_alloc": true, "thread_aware_alloc": { "total_uuar": 100, "shared_uuar": 96, "shared_cq": true },
"use_work_req_throt": false, "work_req_throt": { "initial_credit": 4, "max_credit": 12, "credit_step": 2, "execution_epochs": 60, "sample_cycles": 19200000, "inf_credit_weight": 1.05, "auto_tuning": false },
"use_conflict_avoidance": false, "use_speculative_lookup": false,
"experimental": { "qp_sharing": false } }