huggingface / optimum-benchmark

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
Apache License 2.0
255 stars 48 forks source link

ipex backend enhancements #272

Closed yao-matrix closed 1 month ago

yao-matrix commented 1 month ago
  1. add feature-extraction task mapping for ipex backend, to support embedding models benchmark
  2. change examples' no_weight to false, no_weight will allocate weight buffers and random initialize them, which will ruin performance in numa cases, 2x perf drop for decoding phases
yao-matrix commented 1 month ago

@IlyasMoutawwakil, pls help review, thx.

IlyasMoutawwakil commented 1 month ago

Hi ! are you sure about this no_weight will allocate weight buffers and random initialize them, no_weights only interferes with random generators used inside the model, so instead of using these methods https://github.com/huggingface/optimum-benchmark/blob/01e4e599381bbd166de56057d1649aba9e7d2100/optimum_benchmark/backends/transformers_utils.py#L190-L205 It'll use the fasterst one of them https://github.com/huggingface/optimum-benchmark/blob/01e4e599381bbd166de56057d1649aba9e7d2100/optimum_benchmark/backends/transformers_utils.py#L207-L208 How does that ruin performance ?

yao-matrix commented 1 month ago

Hi ! are you sure about this no_weight will allocate weight buffers and random initialize them, no_weights only interferes with random generators used inside the model, so instead of using these methods

https://github.com/huggingface/optimum-benchmark/blob/01e4e599381bbd166de56057d1649aba9e7d2100/optimum_benchmark/backends/transformers_utils.py#L190-L205

It'll use the fasterst one of them https://github.com/huggingface/optimum-benchmark/blob/01e4e599381bbd166de56057d1649aba9e7d2100/optimum_benchmark/backends/transformers_utils.py#L207-L208

How does that ruin performance ?

seems random init funtions are mostly single thread function(e.g. here), and numa memory allocation strategy somewhat follows a "allocate-while-first-write" way. So in random initialization case, the weights memory will allocate near to the core executing the random initialization logic, which, for example, is done all in numa domain 0. So when model forward computation happens, which may spread across numa domain 0 and 1, the compute in numa domain 1 must far fetch the data from numa domain 0(since the data already allocated while random initialization), which bring far memory access cost.

data evidence, using GCP c4-standard-96 instance to run meta-llama/Meta-Llama-3-8B w/ bs=1, in_seq=256, out_seq=64, no_weights false: decoding throughput 16.37 no_weights true: decoding throughput 8.69

IlyasMoutawwakil commented 1 month ago

very interesting behavior ! thanks for investigating it