The 4bit-quantized TinyLlama-1.1B's weight only takes up 550MB RAM ?

jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Apache License 2.0

7.71k stars 453 forks source link

The 4bit-quantized TinyLlama-1.1B's weight only takes up 550MB RAM ? #114

Closed TapendraBaduwal closed 9 months ago

TapendraBaduwal commented 9 months ago

After fine-tuning the model, I obtained a 2.2 GB PyTorch model.bin file. Is it possible to reduce this model size to 550 MB, and if so, how and when can we achieve this?

RonanKMcGovern commented 9 months ago

You fined tuned a quantized model or the original model?

Probably you fined tuned the original model so now you need to quantized.

Go to the llama.cpp repo and find the quantize folder. Find some Youtube videos if you need help.

jzhang38 commented 9 months ago

It actually takes up around 600MB on disk and around 700MB during inference, with activations taken into account (https://huggingface.co/TinyLlama/TinyLlama-1.1B-python-v0.1/blob/main/ggml-model-q4_0.gguf). I will update the readme.

TapendraBaduwal commented 9 months ago

@jzhang38 After finetune model model_name = "TinyLlama/TinyLlama-1.1B-python-v0.1" on my own dataset with lora and SFTTrainer i got model size 2.05 , is this model size take 600-700 MB or how can we reduce model size upto 600-700MB ? Screenshot from 2023-12-24 12-13-22

jzhang38 commented 9 months ago

@TapendraBaduwal You can checkout llama.cpp

TapendraBaduwal commented 9 months ago

@jzhang38 Thank you . Also how can i apply best practice to continue training after loading from a lora checkpoint ? I want to train lora checkpoint adapter continue training with new dataset.

jzhang38 commented 9 months ago

@TapendraBaduwal I recommend you to check https://github.com/OpenAccess-AI-Collective/axolotl