Performance regression caused by read lock in brgemm

zhczhong commented 2 weeks ago

According to the perf test from @yifeizh2, the read lock will bring 10-20% performance regression https://github.com/intel/graph-compiler/blob/bc0014b12bb58b554277ab74c248535fff8c0cec/lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp#L145

data type| bs | hidden_list | without lock | with lock | -- | -- | -- | -- | -- | -- bf16 | 128 | 16x512 | 0.021597 | 0.01967 | 109.79% bf16 | 128 | 512x256 | 0.025172 | 0.029239 | 86.09% bf16 | 128 | 256x128 | 0.015969 | 0.026034 | 61.34% bf16 | 128 | 512x1024 | 0.046416 | 0.055459 | 83.69% bf16 | 128 | 1024x1024 | 0.082655 | 0.089712 | 92.13% bf16 | 128 | 1024x512 | 0.066204 | 0.066523 | 99.52% bf16 | 128 | 512x256 | 0.025992 | 0.029516 | 88.06%

Menooker commented 1 day ago

read_lock_guard_t g(g_brgemm_lock);

The reader-writer lock will result in serialization in memory even all users of the lock are readers. I think we might need to re-design this, to make it totally lock-free.

crazydemo commented 1 day ago

read_lock_guard_t g(g_brgemm_lock);

The reader-writer lock will result in serialization in memory even all users of the lock are readers. I think we might need to re-design this, to make it totally lock-free.

It might be because the current BRGEMM data is not exclusive to individual threads, resulting in significant contention between threads.

intel / graph-compiler

Performance regression caused by read lock in brgemm #323