Closed avcode-exe closed 1 month ago
i faced the same see your memory consumption, your ram exhausted
Same here ERROR:root:Error occurred while running command: Command '['/usr/bin/python3', 'utils/convert-hf-to-gguf-bitnet.py', 'models/Llama3-8B-1.58-100B-tokens', '--outtype', 'f32']' died with <Signals.SIGKILL: 9>., check details in logs/convert_to_f32_gguf.log Log: cat logs/convert_to_f32_gguf.log INFO:hf-to-gguf:Loading model: Llama3-8B-1.58-100B-tokens INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:hf-to-gguf:Set model parameters INFO:hf-to-gguf:gguf: context length = 8192 INFO:hf-to-gguf:gguf: embedding length = 4096 INFO:hf-to-gguf:gguf: feed forward length = 14336 INFO:hf-to-gguf:gguf: head count = 32 INFO:hf-to-gguf:gguf: key-value head count = 8 INFO:hf-to-gguf:gguf: rope theta = 500000.0 INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05 INFO:hf-to-gguf:gguf: file type = 0 INFO:hf-to-gguf:Set model tokenizer INFO:gguf.vocab:Adding 280147 merge(s). INFO:gguf.vocab:Setting special token type bos to 128000 INFO:gguf.vocab:Setting special token type eos to 128009 INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}{% endif %} INFO:hf-to-gguf:Exporting model to 'models/Llama3-8B-1.58-100B-tokens/ggml-model-f32.gguf' INFO:hf-to-gguf:gguf: loading model part 'model.safetensors' INFO:hf-to-gguf:gguf: loading model part 'model.safetensors' INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F32, shape = {4096, 128256} INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> F32, shape = {4096, 128256} INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.uint8 --> F32, shape = {14336, 4096} INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.uint8 --> F32, shape = {4096, 14336} INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.uint8 --> F32, shape = {4096, 14336} INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.0.attn_k.weight, torch.uint8 --> F32, shape = {4096, 1024} INFO:hf-to-gguf:blk.0.attn_output.weight, torch.uint8 --> F32, shape = {4096, 4096} INFO:hf-to-gguf:blk.0.attn_q.weight, torch.uint8 --> F32, shape = {4096, 4096} INFO:hf-to-gguf:blk.0.attn_v.weight, torch.uint8 --> F32, shape = {4096, 1024} INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.1.ffn_down.weight, torch.uint8 --> F32, shape = {14336, 4096} INFO:hf-to-gguf:blk.1.ffn_gate.weight, torch.uint8 --> F32, shape = {4096, 14336} INFO:hf-to-gguf:blk.1.ffn_up.weight, torch.uint8 --> F32, shape = {4096, 14336} INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.1.attn_k.weight, torch.uint8 --> F32, shape = {4096, 1024} INFO:hf-to-gguf:blk.1.attn_output.weight, torch.uint8 --> F32, shape = {4096, 4096} INFO:hf-to-gguf:blk.1.attn_q.weight, torch.uint8 --> F32, shape = {4096, 4096} INFO:hf-to-gguf:blk.1.attn_v.weight, torch.uint8 --> F32, shape = {4096, 1024} INFO:hf-to-gguf:blk.10.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.10.ffn_down.weight, torch.uint8 --> F32, shape = {14336, 4096} INFO:hf-to-gguf:blk.10.ffn_gate.weight, torch.uint8 --> F32, shape = {4096, 14336}
I tried increasing the memory allocated to WSL by changing the .wslconfig file:
[wsl2]
memory = 16GB
But it's still insufficient.
free
shows the following:
total used free shared buff/cache available
Mem: 16019620 295044 15855528 68 64684 15724576
Swap: 4194304 0 4194304
Going to see if more swap space does the trick, or converting the model on another box.
With my default memory settings (16GB + 4GB), at most I can run the 3B version, not 8B.
you can run 3B ? can you share more details
Yeah, I can run 3B version on 16GB ram + 4GB swap on WSL2 with 13th Gen Intel(R) Core(TM) i7-13700H.
Also did you get any error like FileNotFoundError: [Errno 2] No such file or directory: './build/bin/llama-quantize' ?
No, everything run smoothly except 8B model and the fact that the model is keep repeating. I am trying 20GB ram + 20GB swap on WSL2.
I got 8B version working on 20GB ram + 20GB of swap. Max memory usage when converting:
I got around 6 tokens / second when running. Just increase memory for the system.
is it hallicunating?
I've downloaded the pre-quantized model (GGUF format) wget https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF/resolve/main/Llama3-8B-1.58-100B-tokens-TQ2_0.gguf
and got it running:
python3 run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 10 -t 8 -temp 0.8
Result stats:
llama_perf_sampler_print: sampling time = 378.61 ms / 210 runs ( 1.80 ms per token, 554.67 tokens per second) llama_perf_context_print: load time = 3251.41 ms llama_perf_context_print: prompt eval time = 22899.40 ms / 10 tokens ( 2289.94 ms per token, 0.44 tokens per second) llama_perf_context_print: eval time = 544179.25 ms / 199 runs ( 2734.57 ms per token, 0.37 tokens per second) llama_perf_context_print: total time = 567770.70 ms / 209 tokens
@avcode-exe which one of these is your 6 tokens per second?
is it hallicunating?
yeah
I've downloaded the pre-quantized model (GGUF format) wget https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF/resolve/main/Llama3-8B-1.58-100B-tokens-TQ2_0.gguf
and got it running:
python3 run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 10 -t 8 -temp 0.8
Result stats:
llama_perf_sampler_print: sampling time = 378.61 ms / 210 runs ( 1.80 ms per token, 554.67 tokens per second) llama_perf_context_print: load time = 3251.41 ms llama_perf_context_print: prompt eval time = 22899.40 ms / 10 tokens ( 2289.94 ms per token, 0.44 tokens per second) llama_perf_context_print: eval time = 544179.25 ms / 199 runs ( 2734.57 ms per token, 0.37 tokens per second) llama_perf_context_print: total time = 567770.70 ms / 209 tokens
@avcode-exe which one of these is your 6 tokens per second?
llama_perf_context_print (overall)
I have BitNet running in native Windows10 and via WSL. There seem to be a performance problem, WSL is one magnitude slower.
Sample: python run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 20 -t 8 -temp 0.8
Windows 10: llama_perf_sampler_print: sampling time = 1.87 ms / 30 runs ( 0.06 ms per token, 16068.56 tokens per second) llama_perf_context_print: load time = 1246.32 ms llama_perf_context_print: prompt eval time = 482.70 ms / 10 tokens ( 48.27 ms per token, 20.72 tokens per second) llama_perf_context_print: eval time = 922.53 ms / 19 runs ( 48.55 ms per token, 20.60 tokens per second) llama_perf_context_print: total time = 1417.05 ms / 29 tokens
WSL 2.0: llama_perf_sampler_print: sampling time = 16.02 ms / 30 runs ( 0.53 ms per token, 1872.66 tokens per second) llama_perf_context_print: load time = 1970.14 ms llama_perf_context_print: prompt eval time = 10212.70 ms / 10 tokens ( 1021.27 ms per token, 0.98 tokens per second) llama_perf_context_print: eval time = 22100.21 ms / 19 runs ( 1163.17 ms per token, 0.86 tokens per second) llama_perf_context_print: total time = 32358.76 ms / 29 tokens
I've downloaded the pre-quantized model (GGUF format) wget https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF/resolve/main/Llama3-8B-1.58-100B-tokens-TQ2_0.gguf
and got it running:
python3 run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 10 -t 8 -temp 0.8
Result stats:
llama_perf_sampler_print: sampling time = 378.61 ms / 210 runs ( 1.80 ms per token, 554.67 tokens per second) llama_perf_context_print: load time = 3251.41 ms llama_perf_context_print: prompt eval time = 22899.40 ms / 10 tokens ( 2289.94 ms per token, 0.44 tokens per second) llama_perf_context_print: eval time = 544179.25 ms / 199 runs ( 2734.57 ms per token, 0.37 tokens per second) llama_perf_context_print: total time = 567770.70 ms / 209 tokens
@avcode-exe which one of these is your 6 tokens per second?
Thanks for this solution. I tried in my system too, I am using bare metal Linux, Don't have much great specs (8 GB RAM) Nothing more. Here is my result:
sampler seed: 124082454
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 1, n_predict = 10, n_keep = 1
what is 1 + 1, you know it's 2, you know
llama_perf_sampler_print: sampling time = 1.26 ms / 18 runs ( 0.07 ms per token, 14285.71 tokens per second)
llama_perf_context_print: load time = 16325.24 ms
llama_perf_context_print: prompt eval time = 139224.74 ms / 8 tokens (17403.09 ms per token, 0.06 tokens per second)
llama_perf_context_print: eval time = 158417.50 ms / 9 runs (17601.94 ms per token, 0.06 tokens per second)
llama_perf_context_print: total time = 297646.82 ms / 17 tokens
LOL
Model conversion (fp16->bitnet in convert-hf-to-gguf-bitnet.py) requires substantial memory. We recommend performing the process on a CPU with a large memory capacity. But, the inference can be conducted on a device with lower memory capacity.
Thanks for the clarification!
you can run 3B ? can you share more details
You can check the solution I found at : https://github.com/microsoft/BitNet/issues/77#issuecomment-2436022313
To run the 3B is not the solution @AgungPambudi. The issue remain exist with running Llama3-8B-1.58-100B-tokens-TQ2_0.gguf.
I find out the comments made by @Aavtic is the possible solution but again there is an issue still give half answer it get crash. Attached is the screenshot also model hallucinate.
So @everyone what are the possible action to be taken and the minimum CPU or RAM specification ?
Hi guys!
I got the following error:
When running the following command:
Environment:
Hardware:
logs/convert_to_f32_gguf.log
: