microsoft / BitNet

Official inference framework for 1-bit LLMs
MIT License
11.44k stars 774 forks source link

Extremely slow #50

Open seghier opened 1 month ago

seghier commented 1 month ago

Extremely slow in CPU mode

dawnmsg commented 1 month ago

Could you please provide more details? Which command is extremely slow?

alexeyvolkoff commented 1 month ago

Which compiler?

sunzj commented 1 month ago

Ubuntu 20.04 Clang-18

main: llama threadpool init, n_threads = 2

system_info: n_threads = 2 (n_threads_batch = 2) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 4294967295 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> greedy generate: n_ctx = 2048, n_batch = 1, n_predict = 6, n_keep = 1

Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary? Answer: Mary is in the garden.

llama_perf_sampler_print: sampling time = 1.56 ms / 54 runs ( 0.03 ms per token, 34526.85 tokens per second) llama_perf_context_print: load time = 1756.28 ms llama_perf_context_print: prompt eval time = 36718.06 ms / 48 tokens ( 764.96 ms per token, 1.31 tokens per second) llama_perf_context_print: eval time = 3840.11 ms / 5 runs ( 768.02 ms per token, 1.30 tokens per second) llama_perf_context_print: total time = 40564.05 ms / 53 tokens