microsoft / BitNet

Official inference framework for 1-bit LLMs
MIT License
11.39k stars 769 forks source link

Died with `Signals.SIGKILL: 9` when running `setup_env.py` #27

Closed avcode-exe closed 1 month ago

avcode-exe commented 1 month ago

Hi guys!

I got the following error:

INFO:root:Compiling the code using CMake.
INFO:root:Downloading model HF1BitLLM/Llama3-8B-1.58-100B-tokens from HuggingFace to models/Llama3-8B-1.58-100B-tokens...
INFO:root:Converting HF model to GGUF format...
ERROR:root:Error occurred while running command: Command '['/home/airtau/miniconda3/envs/bitnet-cpp/bin/python', 'utils/convert-hf-to-gguf-bitnet.py', 'models/Llama3-8B-1.58-100B-tokens', '--outtype', 'f32']' died with <Signals.SIGKILL: 9>., check details in logs/convert_to_f32_gguf.log

When running the following command:

python setup_env.py --hf-repo HF1BitLLM/Llama3-8B-1.58-100B-tokens -q i2_s

Environment:

Hardware:

logs/convert_to_f32_gguf.log:

INFO:hf-to-gguf:Loading model: Llama3-8B-1.58-100B-tokens
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 8192
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 14336
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 500000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 0
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128009
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
INFO:hf-to-gguf:Exporting model to 'models/Llama3-8B-1.58-100B-tokens/ggml-model-f32.gguf'
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> F32, shape = {4096, 128256}
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F32, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.1.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.1.ffn_down.weight,       torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.1.ffn_gate.weight,       torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.1.ffn_up.weight,         torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.1.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.1.attn_k.weight,         torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.1.attn_output.weight,    torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.1.attn_q.weight,         torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.1.attn_v.weight,         torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.10.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.10.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.10.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.10.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.10.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.10.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.10.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.10.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.10.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.11.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.11.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.11.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.11.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.11.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.11.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.11.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.11.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.11.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.12.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.12.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.12.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.12.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.12.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.12.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.12.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.12.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.12.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.13.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.13.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.13.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.13.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.13.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.13.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.13.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.13.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.13.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.14.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.14.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.14.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.14.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.14.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.14.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.14.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.14.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.14.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.15.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.15.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.15.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.15.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.15.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.15.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.15.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.15.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.15.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.16.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.16.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.16.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.16.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.16.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.16.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.16.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.16.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.16.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.17.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.17.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.17.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.17.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.17.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.17.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.17.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.17.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.17.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.18.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.18.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.18.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.18.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.18.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.18.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.18.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.18.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.18.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.19.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.19.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.19.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.19.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.19.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.19.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.19.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.19.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.19.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.2.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.2.ffn_down.weight,       torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.2.ffn_gate.weight,       torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.2.ffn_up.weight,         torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.2.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.2.attn_k.weight,         torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.2.attn_output.weight,    torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.2.attn_q.weight,         torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.2.attn_v.weight,         torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.20.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.20.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.20.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.20.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.20.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.20.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.20.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.20.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.20.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.21.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.21.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.21.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.21.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.21.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.21.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.21.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.21.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.21.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.22.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.22.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.22.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.22.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.22.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.22.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.22.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.22.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.22.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.23.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.23.ffn_down.weight,      torch.uint8 --> F32, shape = {14336, 4096}
INFO:hf-to-gguf:blk.23.ffn_gate.weight,      torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.23.ffn_up.weight,        torch.uint8 --> F32, shape = {4096, 14336}
INFO:hf-to-gguf:blk.23.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.23.attn_k.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.23.attn_output.weight,   torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.23.attn_q.weight,        torch.uint8 --> F32, shape = {4096, 4096}
INFO:hf-to-gguf:blk.23.attn_v.weight,        torch.uint8 --> F32, shape = {4096, 1024}
INFO:hf-to-gguf:blk.24.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
Dead-Bytes commented 1 month ago

i faced the same see your memory consumption, your ram exhausted

alexeyvolkoff commented 1 month ago

Same here ERROR:root:Error occurred while running command: Command '['/usr/bin/python3', 'utils/convert-hf-to-gguf-bitnet.py', 'models/Llama3-8B-1.58-100B-tokens', '--outtype', 'f32']' died with <Signals.SIGKILL: 9>., check details in logs/convert_to_f32_gguf.log Log: cat logs/convert_to_f32_gguf.log INFO:hf-to-gguf:Loading model: Llama3-8B-1.58-100B-tokens INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:hf-to-gguf:Set model parameters INFO:hf-to-gguf:gguf: context length = 8192 INFO:hf-to-gguf:gguf: embedding length = 4096 INFO:hf-to-gguf:gguf: feed forward length = 14336 INFO:hf-to-gguf:gguf: head count = 32 INFO:hf-to-gguf:gguf: key-value head count = 8 INFO:hf-to-gguf:gguf: rope theta = 500000.0 INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05 INFO:hf-to-gguf:gguf: file type = 0 INFO:hf-to-gguf:Set model tokenizer INFO:gguf.vocab:Adding 280147 merge(s). INFO:gguf.vocab:Setting special token type bos to 128000 INFO:gguf.vocab:Setting special token type eos to 128009 INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %} INFO:hf-to-gguf:Exporting model to 'models/Llama3-8B-1.58-100B-tokens/ggml-model-f32.gguf' INFO:hf-to-gguf:gguf: loading model part 'model.safetensors' INFO:hf-to-gguf:gguf: loading model part 'model.safetensors' INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F32, shape = {4096, 128256} INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> F32, shape = {4096, 128256} INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.uint8 --> F32, shape = {14336, 4096} INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.uint8 --> F32, shape = {4096, 14336} INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.uint8 --> F32, shape = {4096, 14336} INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.0.attn_k.weight, torch.uint8 --> F32, shape = {4096, 1024} INFO:hf-to-gguf:blk.0.attn_output.weight, torch.uint8 --> F32, shape = {4096, 4096} INFO:hf-to-gguf:blk.0.attn_q.weight, torch.uint8 --> F32, shape = {4096, 4096} INFO:hf-to-gguf:blk.0.attn_v.weight, torch.uint8 --> F32, shape = {4096, 1024} INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.1.ffn_down.weight, torch.uint8 --> F32, shape = {14336, 4096} INFO:hf-to-gguf:blk.1.ffn_gate.weight, torch.uint8 --> F32, shape = {4096, 14336} INFO:hf-to-gguf:blk.1.ffn_up.weight, torch.uint8 --> F32, shape = {4096, 14336} INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.1.attn_k.weight, torch.uint8 --> F32, shape = {4096, 1024} INFO:hf-to-gguf:blk.1.attn_output.weight, torch.uint8 --> F32, shape = {4096, 4096} INFO:hf-to-gguf:blk.1.attn_q.weight, torch.uint8 --> F32, shape = {4096, 4096} INFO:hf-to-gguf:blk.1.attn_v.weight, torch.uint8 --> F32, shape = {4096, 1024} INFO:hf-to-gguf:blk.10.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.10.ffn_down.weight, torch.uint8 --> F32, shape = {14336, 4096} INFO:hf-to-gguf:blk.10.ffn_gate.weight, torch.uint8 --> F32, shape = {4096, 14336}

bwv988 commented 1 month ago

I tried increasing the memory allocated to WSL by changing the .wslconfig file:

[wsl2]
memory = 16GB

But it's still insufficient.

free shows the following:

 total        used        free      shared  buff/cache   available
Mem:        16019620      295044    15855528          68       64684    15724576
Swap:        4194304           0     4194304

Going to see if more swap space does the trick, or converting the model on another box.

avcode-exe commented 1 month ago

With my default memory settings (16GB + 4GB), at most I can run the 3B version, not 8B.

Dead-Bytes commented 1 month ago

you can run 3B ? can you share more details

avcode-exe commented 1 month ago

Yeah, I can run 3B version on 16GB ram + 4GB swap on WSL2 with 13th Gen Intel(R) Core(TM) i7-13700H.

Dead-Bytes commented 1 month ago

Also did you get any error like FileNotFoundError: [Errno 2] No such file or directory: './build/bin/llama-quantize' ?

avcode-exe commented 1 month ago

No, everything run smoothly except 8B model and the fact that the model is keep repeating. I am trying 20GB ram + 20GB swap on WSL2.

avcode-exe commented 1 month ago

I got 8B version working on 20GB ram + 20GB of swap. Max memory usage when converting:

I got around 6 tokens / second when running. Just increase memory for the system.

Dead-Bytes commented 1 month ago

is it hallicunating?

alexeyvolkoff commented 1 month ago

I've downloaded the pre-quantized model (GGUF format) wget https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF/resolve/main/Llama3-8B-1.58-100B-tokens-TQ2_0.gguf

and got it running:

python3 run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 10 -t 8 -temp 0.8

Result stats: llama_perf_sampler_print: sampling time = 378.61 ms / 210 runs ( 1.80 ms per token, 554.67 tokens per second) llama_perf_context_print: load time = 3251.41 ms llama_perf_context_print: prompt eval time = 22899.40 ms / 10 tokens ( 2289.94 ms per token, 0.44 tokens per second) llama_perf_context_print: eval time = 544179.25 ms / 199 runs ( 2734.57 ms per token, 0.37 tokens per second) llama_perf_context_print: total time = 567770.70 ms / 209 tokens

@avcode-exe which one of these is your 6 tokens per second?

avcode-exe commented 1 month ago

is it hallicunating?

yeah

avcode-exe commented 1 month ago

I've downloaded the pre-quantized model (GGUF format) wget https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF/resolve/main/Llama3-8B-1.58-100B-tokens-TQ2_0.gguf

and got it running:

python3 run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 10 -t 8 -temp 0.8

Result stats: llama_perf_sampler_print: sampling time = 378.61 ms / 210 runs ( 1.80 ms per token, 554.67 tokens per second) llama_perf_context_print: load time = 3251.41 ms llama_perf_context_print: prompt eval time = 22899.40 ms / 10 tokens ( 2289.94 ms per token, 0.44 tokens per second) llama_perf_context_print: eval time = 544179.25 ms / 199 runs ( 2734.57 ms per token, 0.37 tokens per second) llama_perf_context_print: total time = 567770.70 ms / 209 tokens

@avcode-exe which one of these is your 6 tokens per second?

llama_perf_context_print (overall)

bmerkle commented 1 month ago

I have BitNet running in native Windows10 and via WSL. There seem to be a performance problem, WSL is one magnitude slower.

Sample: python run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 20 -t 8 -temp 0.8

Windows 10: llama_perf_sampler_print: sampling time = 1.87 ms / 30 runs ( 0.06 ms per token, 16068.56 tokens per second) llama_perf_context_print: load time = 1246.32 ms llama_perf_context_print: prompt eval time = 482.70 ms / 10 tokens ( 48.27 ms per token, 20.72 tokens per second) llama_perf_context_print: eval time = 922.53 ms / 19 runs ( 48.55 ms per token, 20.60 tokens per second) llama_perf_context_print: total time = 1417.05 ms / 29 tokens

WSL 2.0: llama_perf_sampler_print: sampling time = 16.02 ms / 30 runs ( 0.53 ms per token, 1872.66 tokens per second) llama_perf_context_print: load time = 1970.14 ms llama_perf_context_print: prompt eval time = 10212.70 ms / 10 tokens ( 1021.27 ms per token, 0.98 tokens per second) llama_perf_context_print: eval time = 22100.21 ms / 19 runs ( 1163.17 ms per token, 0.86 tokens per second) llama_perf_context_print: total time = 32358.76 ms / 29 tokens

Aavtic commented 1 month ago

I've downloaded the pre-quantized model (GGUF format) wget https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF/resolve/main/Llama3-8B-1.58-100B-tokens-TQ2_0.gguf

and got it running:

python3 run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 10 -t 8 -temp 0.8

Result stats: llama_perf_sampler_print: sampling time = 378.61 ms / 210 runs ( 1.80 ms per token, 554.67 tokens per second) llama_perf_context_print: load time = 3251.41 ms llama_perf_context_print: prompt eval time = 22899.40 ms / 10 tokens ( 2289.94 ms per token, 0.44 tokens per second) llama_perf_context_print: eval time = 544179.25 ms / 199 runs ( 2734.57 ms per token, 0.37 tokens per second) llama_perf_context_print: total time = 567770.70 ms / 209 tokens

@avcode-exe which one of these is your 6 tokens per second?

Thanks for this solution. I tried in my system too, I am using bare metal Linux, Don't have much great specs (8 GB RAM) Nothing more. Here is my result:

sampler seed: 124082454
sampler params: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 2048, n_batch = 1, n_predict = 10, n_keep = 1

what is 1 + 1, you know it's 2, you know

llama_perf_sampler_print:    sampling time =       1.26 ms /    18 runs   (    0.07 ms per token, 14285.71 tokens per second)
llama_perf_context_print:        load time =   16325.24 ms
llama_perf_context_print: prompt eval time =  139224.74 ms /     8 tokens (17403.09 ms per token,     0.06 tokens per second)
llama_perf_context_print:        eval time =  158417.50 ms /     9 runs   (17601.94 ms per token,     0.06 tokens per second)
llama_perf_context_print:       total time =  297646.82 ms /    17 tokens

LOL

dawnmsg commented 1 month ago

Model conversion (fp16->bitnet in convert-hf-to-gguf-bitnet.py) requires substantial memory. We recommend performing the process on a CPU with a large memory capacity. But, the inference can be conducted on a device with lower memory capacity.

avcode-exe commented 1 month ago

Thanks for the clarification!

AgungPambudi commented 4 weeks ago

you can run 3B ? can you share more details

You can check the solution I found at : https://github.com/microsoft/BitNet/issues/77#issuecomment-2436022313

wamiqraza commented 3 weeks ago

To run the 3B is not the solution @AgungPambudi. The issue remain exist with running Llama3-8B-1.58-100B-tokens-TQ2_0.gguf.

I find out the comments made by @Aavtic is the possible solution but again there is an issue still give half answer it get crash. Attached is the screenshot also model hallucinate.

So @everyone what are the possible action to be taken and the minimum CPU or RAM specification ? Image