intel / graph-compiler

Apache License 2.0
26 stars 14 forks source link

Performance regression caused by read lock in brgemm #323

Open zhczhong opened 2 weeks ago

zhczhong commented 2 weeks ago

According to the perf test from @yifeizh2, the read lock will bring 10-20% performance regression https://github.com/intel/graph-compiler/blob/bc0014b12bb58b554277ab74c248535fff8c0cec/lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp#L145

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  data type| bs  | hidden_list  | without lock | with lock |   -- | -- | -- | -- | -- | -- bf16 | 128 | 16x512 | 0.021597 | 0.01967 | 109.79% bf16 | 128 | 512x256 | 0.025172 | 0.029239 | 86.09% bf16 | 128 | 256x128 | 0.015969 | 0.026034 | 61.34% bf16 | 128 | 512x1024 | 0.046416 | 0.055459 | 83.69% bf16 | 128 | 1024x1024 | 0.082655 | 0.089712 | 92.13% bf16 | 128 | 1024x512 | 0.066204 | 0.066523 | 99.52% bf16 | 128 | 512x256 | 0.025992 | 0.029516 | 88.06%

Menooker commented 1 day ago

read_lock_guard_t g(g_brgemm_lock);

The reader-writer lock will result in serialization in memory even all users of the lock are readers. I think we might need to re-design this, to make it totally lock-free.

crazydemo commented 1 day ago

read_lock_guard_t g(g_brgemm_lock);

The reader-writer lock will result in serialization in memory even all users of the lock are readers. I think we might need to re-design this, to make it totally lock-free.

It might be because the current BRGEMM data is not exclusive to individual threads, resulting in significant contention between threads.