optimize thread local cache for brgemm

crazydemo commented 1 month ago

This PR uses std::vector for thread local cache.

crazydemo commented 1 month ago

This PR brings slight performance gain.

ciyongch commented 1 month ago

How much perf gain from this patch? Does this perf gain also comes from "desc" to "*desc"? There could be some bubbles in the new vector as each individual thread might have less "kernel_idx" than global cache?

crazydemo commented 1 month ago

The perf gain partially comes from desc -> *desc (brings 1% perf gain), and more significantly from using std::vector (brings another 5% perf gain). <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

main | main+ptr | this PR | main / main+ptr | main / this PR | dtype | cmd -- | -- | -- | -- | -- | -- | -- 0.103666 | 0.106254 | 0.103906 | 0.975646 | 0.99769 | f32 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 0.343486 | 0.318651 | 0.31864 | 1.077938 | 1.077972 | f32 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50 2.5479 | 2.518856 | 2.525348 | 1.011531 | 1.00893 | f32 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 7.437604 | 7.396616 | 7.19349 | 1.005541 | 1.033935 | f32 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 0.080631 | 0.079522 | 0.058027 | 1.013946 | 1.389532 | bf16 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 0.171087 | 0.16779 | 0.167892 | 1.019654 | 1.019032 | bf16 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --dtype=bf16 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50 0.77902 | 0.788544 | 0.788094 | 0.987921 | 0.988486 | bf16 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 2.101495 | 2.119327 | 2.059455 | 0.991586 | 1.020413 | bf16 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 geomean | | | 1.010062 | 1.060706 | |

Menooker commented 1 month ago

It comes to me that we might not need any locking at run time... Maybe we can fall back to the original solution. Here is my justification:

the kernel pool is append-only. Once a kernel is added to the array, it will not be modified or deleted
we can use std::deque instead of vector. It will not re-alloc the array when resizing. So it is safe to push_back a new desc even other threads are reading
We only need a lock for appending (writing) the pool. Reading the pool does not really need a lock
We can fall back to the shared unique std::queue<desc> solution. It does not require thread-local (TLS is fast, but still consumes some time), and no need for indirect pointer & shared ptr.

crazydemo commented 1 month ago

checked the current threal local solution performance and deque solution. The performance comparison is as below: thread_local has 2.4% perf gain on the following case.

this PR | deque | this PR / deque | dtype | cmd -- | -- | -- | -- | -- 0.102334 | 0.105171 | 0.973028 | f32 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 0.289466 | 0.346502 | 0.835395 | f32 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50 2.52066 | 2.520477 | 1.000073 | f32 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 7.481903 | 7.219393 | 1.036362 | f32 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 0.081189 | 0.080492 | 1.00866 | bf16 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 0.166998 | 0.169265 | 0.986609 | bf16 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --dtype=bf16 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50 0.775384 | 0.787742 | 0.984313 | bf16 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 2.018033 | 2.014215 | 1.001896 | bf16 | OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50 geomean | | 0.976508 | |

crazydemo commented 1 month ago

With @zhczhong 's help, we get the following updated data on bf16, showing deque and thread local solution have similar performance. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

data type | bs | deque | thread local | deque / thread local -- | -- | -- | -- | -- bf16 | 128 | 0.005458 | 0.005775 | 0.945164 bf16 | 128 | 0.010434 | 0.010398 | 1.003462 bf16 | 128 | 0.006627 | 0.006268 | 1.057243 bf16 | 128 | 0.018755 | 0.0182 | 1.030526 bf16 | 128 | 0.013196 | 0.013407 | 0.984251 bf16 | 128 | 0.01079 | 0.0109 | 0.989908 bf16 | 128 | 0.009203 | 0.009742 | 0.944711 geomean | | | | 0.992869

crazydemo commented 1 month ago

Use static vector as global cache to store brgemm kernels with default size, got the best performance.

intel / graph-compiler

optimize thread local cache for brgemm #353