jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.71k stars 453 forks source link

我想要使用这个模型 #20

Closed ChuXNobody closed 11 months ago

ChuXNobody commented 1 year ago

最低显卡需求要多少 能不能 进行 cpu 推理 能不能模型微调

VatsaDev commented 1 year ago

(Disclaimer: I used google translate)

  1. From the colab, the Full model takes up a whole T4 gpu with vLLM, can't say for any of the quantized models
  2. to my understanding, GGUF models can run on CPU, try it with Llama.cpp
  3. There is no direct finetune script right now, though you could try to finetune a checkpoint yourself
jzhang38 commented 1 year ago

@VatsaDev I think vLLM allocates the whole GPU RAM by default to facilitate fast inference (It even occupies 40GB memory on my A40). It might not indicate the model needs that much memory. You should verify on your end.

VatsaDev commented 1 year ago

yeah vLLM does appear to just take up all the ram. I tried the Full 32bit GGUF model by @Green-Sky and it took 1GB of cpu ram, but garbage response testing it on llama-cpp-python appears to not work @jzhang38 any way to run without vLLM?

ChuXNobody commented 1 year ago

@VatsaDev I think vLLM allocates the whole GPU RAM by default to facilitate fast inference (It even occupies 40GB memory on my A40). It might not indicate the model needs that much memory. You should verify on your end. 量化之后测试如何呢,我想要部署该模型,我并没有那么良好的GPU去微调训练,有没有好的策略进行部署呢?关于数据集格式,在文档中并没有一个明确示例,可以给一个明确的示例以及几条训练样本吗?

VatsaDev commented 1 year ago

@ChuXNobody the model isnt fully finished yet, so there is no finetune script. I've raised an issue for adding data at #22. Quantized models are found at https://huggingface.co/Green-Sky/TinyLlama-1.1B-step-50K-105b-GGUF/ and can be run with llama.cpp

Green-Sky commented 1 year ago

I just uploaded them, to demonstrate and to make testing it easier. but as @VatsaDev says, the 105B tokens checkpoint is pretty much unusable quality wise. :smile:

VatsaDev commented 1 year ago

Well the models comprehendable on GPU. @Green-Sky How do we run the model gpu-wise without vLLM?

VatsaDev commented 1 year ago

Ok, after using the code on huggingface with chat v1, can confirm it runs on a GPU using only ~3GB ram. Really optimized in comparison to to nanogpt taking 4-5GB for a 345m model

jzhang38 commented 1 year ago

I have uploaded a correctly converted GGUF weight to https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.2-GGUF.

The 4-bit quantized weight takes around 600MB to 700MB RAM during inference using llama.cpp

Finetuning using QLora consumes less than 4GB RAM. I will publish a script for low RAM full fine-tuning soon.

VatsaDev commented 1 year ago

The converted model weights appear to speak gibberish spanish, even when prompted correctly

ex: when asked "whats a joke?" response: Ci sono molte risposte in cui puoi provare:\n- "La mia terra lo sapevo, ma ora s\'è arrivata la strada. Sono un angelino che mi ha vista durante la mia gara di tiro. E \'rai? E \'rai per sempre?

RonanKMcGovern commented 1 year ago

I got a gguf model working nicely: https://huggingface.co/Trelis/TinyLlama-1.1B-Chat-v0.1-GGUF

I find the v0.1 chat with guanaco style a bit easier to set up than the chatml style prompting with v0.2 . It will be nice to have some models in the original Llama prompt style too - I'll maybe see about helping with that if someone else doesn't do it. I like the LoRA fine-tune approach. Probably best to avoid QLoRA just to keep the perplexity down if it's a chat model that others will later use/quantize.

Well done to the team on TinyLlama, the inference I got at the 250-500k checkpoint was pretty nice. I'm excited to see what happens with more training.

VatsaDev commented 1 year ago

Same, The Original Format seemed better. I've used a format similar to chatml, but its probably better to just use the guanaco style, as its just as effective

jzhang38 commented 1 year ago

@VatsaDev This is what I got with my gguf 4 bit checkpoint:

 ./main -m /Users/peiyuanzhang/Downloads/llama.cpp/models/TinyLlama-1.1B-Chat-v0.2-GGUF/ggml-model-q4_0.gguf  \
        -e -ngl 1 -t 4 -n 256 -c 4096 -s 8 --top_k 150 \
        -p "<|im_start|>user\nWhat is a joke?<|im_end|>\n<|im_start|>assistant\n"

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 1, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 256, n_keep = 0

 <|im_start|>user
What is a joke?<|im_end|>
<|im_start|>assistant
It depends on what you mean by joke. If you are talking about a funny situation, situation that makes people laugh, or something that is just ridiculous in general, then I can tell you a joke.<|im_end|>
 [end of text]

llama_print_timings:        load time =   138.80 ms
llama_print_timings:      sample time =    39.64 ms /    54 runs   (    0.73 ms per token,  1362.12 tokens per second)
llama_print_timings: prompt eval time =    27.77 ms /    34 tokens (    0.82 ms per token,  1224.43 tokens per second)
llama_print_timings:        eval time =   299.32 ms /    53 runs   (    5.65 ms per token,   177.07 tokens per second)
llama_print_timings:       total time =   407.88 ms
ggml_metal_free: deallocating
Log end
Green-Sky commented 1 year ago

@jzhang38 your parameters are kinda whak.

  1. your context size is way past what tiny llama is trained on max_position_embeddings": 2048
  2. the number of tokens to predict is way less than your context. I suggest -2, which limits the number of predictions until the context fills up :)
  3. minor, but your top-k is kinda large. the extra possibilities that "unprobable" are usually trash.
VatsaDev commented 1 year ago

@jzhang38 I see, I'm on 32bit, and the ChatML prompt format seems unusually important? I was just missing the newline tokens from the chatml format, and adding the newlines fixed everything

With newlines

<|im_start|>user\nExplain huggingface.<|im_end|><|im_start|>assistant\n HuggingFace is an open-source project that aims to make the process of building own machine learning models and other software more efficient. It was created by a group of researchers at OpenAI, including CEO Sam Altman

Without newlines

<|im_start|>user Explain huggingface.<|im_end|><|im_start|>assistant \nQuoi ?<|im_end|>\n<|im_start|>assistant\nPourquoi ?<|im_end|>\n

https://colab.research.google.com/drive/1G2K8ABCS8Khkc6rY0RTvxp4xp878i1IT?usp=sharing

ChuXNobody commented 12 months ago

这是我通过 gguf 4 位检查点得到的:

 ./main -m /Users/peiyuanzhang/Downloads/llama.cpp/models/TinyLlama-1.1B-Chat-v0.2-GGUF/ggml-model-q4_0.gguf  \
        -e -ngl 1 -t 4 -n 256 -c 4096 -s 8 --top_k 150 \
        -p "<|im_start|>user\nWhat is a joke?<|im_end|>\n<|im_start|>assistant\n"

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 1, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 256, n_keep = 0

 <|im_start|>user
What is a joke?<|im_end|>
<|im_start|>assistant
It depends on what you mean by joke. If you are talking about a funny situation, situation that makes people laugh, or something that is just ridiculous in general, then I can tell you a joke.<|im_end|>
 [end of text]

llama_print_timings:        load time =   138.80 ms
llama_print_timings:      sample time =    39.64 ms /    54 runs   (    0.73 ms per token,  1362.12 tokens per second)
llama_print_timings: prompt eval time =    27.77 ms /    34 tokens (    0.82 ms per token,  1224.43 tokens per second)
llama_print_timings:        eval time =   299.32 ms /    53 runs   (    5.65 ms per token,   177.07 tokens per second)
llama_print_timings:       total time =   407.88 ms
ggml_metal_free: deallocating
Log end

4精度的模型,能达到1.5token/s 已经是不错的一个成绩了,但对于微调模型参数,我在您的参数中,我看到了上下文超过4096,对于该参数等详细信息,是否有记忆上下文功能呢?训练是否有变化呢?如果有,请更新库,方便于复现进行微调,1.1B的llama效果极佳的话,实用性是优先考虑的,逻辑聊天的,支持什么语言的,这些我们是否能进行引导prompt

Green-Sky commented 12 months ago

@ChuXNobody did you try with and without blas? (did you use openblas or some other provider?)