Closed yao-matrix closed 1 month ago
@IlyasMoutawwakil, pls help review, thx.
Hi ! are you sure about this no_weight will allocate weight buffers and random initialize them
, no_weights only interferes with random generators used inside the model, so instead of using these methods
https://github.com/huggingface/optimum-benchmark/blob/01e4e599381bbd166de56057d1649aba9e7d2100/optimum_benchmark/backends/transformers_utils.py#L190-L205
It'll use the fasterst one of them https://github.com/huggingface/optimum-benchmark/blob/01e4e599381bbd166de56057d1649aba9e7d2100/optimum_benchmark/backends/transformers_utils.py#L207-L208
How does that ruin performance ?
Hi ! are you sure about this
no_weight will allocate weight buffers and random initialize them
, no_weights only interferes with random generators used inside the model, so instead of using these methodsIt'll use the fasterst one of them https://github.com/huggingface/optimum-benchmark/blob/01e4e599381bbd166de56057d1649aba9e7d2100/optimum_benchmark/backends/transformers_utils.py#L207-L208
How does that ruin performance ?
seems random init funtions are mostly single thread function(e.g. here), and numa memory allocation strategy somewhat follows a "allocate-while-first-write" way. So in random initialization case, the weights memory will allocate near to the core executing the random initialization logic, which, for example, is done all in numa domain 0. So when model forward computation happens, which may spread across numa domain 0 and 1, the compute in numa domain 1 must far fetch the data from numa domain 0(since the data already allocated while random initialization), which bring far memory access cost.
data evidence, using GCP c4-standard-96 instance to run meta-llama/Meta-Llama-3-8B
w/ bs=1, in_seq=256, out_seq=64,
no_weights false: decoding throughput 16.37
no_weights true: decoding throughput 8.69
very interesting behavior ! thanks for investigating it