Closed ChuXNobody closed 11 months ago
(Disclaimer: I used google translate)
@VatsaDev I think vLLM allocates the whole GPU RAM by default to facilitate fast inference (It even occupies 40GB memory on my A40). It might not indicate the model needs that much memory. You should verify on your end.
yeah vLLM does appear to just take up all the ram.
I tried the Full 32bit GGUF model by @Green-Sky and it took 1GB of cpu ram, but garbage response
testing it on llama-cpp-python appears to not work
@jzhang38 any way to run without vLLM?
@VatsaDev I think vLLM allocates the whole GPU RAM by default to facilitate fast inference (It even occupies 40GB memory on my A40). It might not indicate the model needs that much memory. You should verify on your end. 量化之后测试如何呢,我想要部署该模型,我并没有那么良好的GPU去微调训练,有没有好的策略进行部署呢?关于数据集格式,在文档中并没有一个明确示例,可以给一个明确的示例以及几条训练样本吗?
@ChuXNobody the model isnt fully finished yet, so there is no finetune script. I've raised an issue for adding data at #22. Quantized models are found at https://huggingface.co/Green-Sky/TinyLlama-1.1B-step-50K-105b-GGUF/ and can be run with llama.cpp
I just uploaded them, to demonstrate and to make testing it easier. but as @VatsaDev says, the 105B tokens checkpoint is pretty much unusable quality wise. :smile:
Well the models comprehendable on GPU. @Green-Sky How do we run the model gpu-wise without vLLM?
Ok, after using the code on huggingface with chat v1, can confirm it runs on a GPU using only ~3GB ram. Really optimized in comparison to to nanogpt taking 4-5GB for a 345m model
I have uploaded a correctly converted GGUF weight to https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.2-GGUF.
The 4-bit quantized weight takes around 600MB to 700MB RAM during inference using llama.cpp
Finetuning using QLora consumes less than 4GB RAM. I will publish a script for low RAM full fine-tuning soon.
The converted model weights appear to speak gibberish spanish, even when prompted correctly
ex: when asked "whats a joke?" response: Ci sono molte risposte in cui puoi provare:\n- "La mia terra lo sapevo, ma ora s\'è arrivata la strada. Sono un angelino che mi ha vista durante la mia gara di tiro. E \'rai? E \'rai per sempre?
I got a gguf model working nicely: https://huggingface.co/Trelis/TinyLlama-1.1B-Chat-v0.1-GGUF
I find the v0.1 chat with guanaco style a bit easier to set up than the chatml style prompting with v0.2 . It will be nice to have some models in the original Llama prompt style too - I'll maybe see about helping with that if someone else doesn't do it. I like the LoRA fine-tune approach. Probably best to avoid QLoRA just to keep the perplexity down if it's a chat model that others will later use/quantize.
Well done to the team on TinyLlama, the inference I got at the 250-500k checkpoint was pretty nice. I'm excited to see what happens with more training.
Same, The Original Format seemed better. I've used a format similar to chatml, but its probably better to just use the guanaco style, as its just as effective
@VatsaDev This is what I got with my gguf 4 bit checkpoint:
./main -m /Users/peiyuanzhang/Downloads/llama.cpp/models/TinyLlama-1.1B-Chat-v0.2-GGUF/ggml-model-q4_0.gguf \
-e -ngl 1 -t 4 -n 256 -c 4096 -s 8 --top_k 150 \
-p "<|im_start|>user\nWhat is a joke?<|im_end|>\n<|im_start|>assistant\n"
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 1, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 256, n_keep = 0
<|im_start|>user
What is a joke?<|im_end|>
<|im_start|>assistant
It depends on what you mean by joke. If you are talking about a funny situation, situation that makes people laugh, or something that is just ridiculous in general, then I can tell you a joke.<|im_end|>
[end of text]
llama_print_timings: load time = 138.80 ms
llama_print_timings: sample time = 39.64 ms / 54 runs ( 0.73 ms per token, 1362.12 tokens per second)
llama_print_timings: prompt eval time = 27.77 ms / 34 tokens ( 0.82 ms per token, 1224.43 tokens per second)
llama_print_timings: eval time = 299.32 ms / 53 runs ( 5.65 ms per token, 177.07 tokens per second)
llama_print_timings: total time = 407.88 ms
ggml_metal_free: deallocating
Log end
@jzhang38 your parameters are kinda whak.
max_position_embeddings": 2048
-2
, which limits the number of predictions until the context fills up :)@jzhang38 I see, I'm on 32bit, and the ChatML prompt format seems unusually important? I was just missing the newline tokens from the chatml format, and adding the newlines fixed everything
With newlines
<|im_start|>user\nExplain huggingface.<|im_end|><|im_start|>assistant\n HuggingFace is an open-source project that aims to make the process of building own machine learning models and other software more efficient. It was created by a group of researchers at OpenAI, including CEO Sam Altman
Without newlines
<|im_start|>user Explain huggingface.<|im_end|><|im_start|>assistant \nQuoi ?<|im_end|>\n<|im_start|>assistant\nPourquoi ?<|im_end|>\n
https://colab.research.google.com/drive/1G2K8ABCS8Khkc6rY0RTvxp4xp878i1IT?usp=sharing
这是我通过 gguf 4 位检查点得到的:
./main -m /Users/peiyuanzhang/Downloads/llama.cpp/models/TinyLlama-1.1B-Chat-v0.2-GGUF/ggml-model-q4_0.gguf \ -e -ngl 1 -t 4 -n 256 -c 4096 -s 8 --top_k 150 \ -p "<|im_start|>user\nWhat is a joke?<|im_end|>\n<|im_start|>assistant\n"
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 1, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = 256, n_keep = 0 <|im_start|>user What is a joke?<|im_end|> <|im_start|>assistant It depends on what you mean by joke. If you are talking about a funny situation, situation that makes people laugh, or something that is just ridiculous in general, then I can tell you a joke.<|im_end|> [end of text] llama_print_timings: load time = 138.80 ms llama_print_timings: sample time = 39.64 ms / 54 runs ( 0.73 ms per token, 1362.12 tokens per second) llama_print_timings: prompt eval time = 27.77 ms / 34 tokens ( 0.82 ms per token, 1224.43 tokens per second) llama_print_timings: eval time = 299.32 ms / 53 runs ( 5.65 ms per token, 177.07 tokens per second) llama_print_timings: total time = 407.88 ms ggml_metal_free: deallocating Log end
4精度的模型,能达到1.5token/s 已经是不错的一个成绩了,但对于微调模型参数,我在您的参数中,我看到了上下文超过4096,对于该参数等详细信息,是否有记忆上下文功能呢?训练是否有变化呢?如果有,请更新库,方便于复现进行微调,1.1B的llama效果极佳的话,实用性是优先考虑的,逻辑聊天的,支持什么语言的,这些我们是否能进行引导prompt
@ChuXNobody did you try with and without blas? (did you use openblas or some other provider?)
最低显卡需求要多少 能不能 进行 cpu 推理 能不能模型微调