janhq / jan

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer. Multiple engine support (llama.cpp, TensorRT-LLM)
https://jan.ai/
GNU Affero General Public License v3.0
22.52k stars 1.3k forks source link

feat: Jan can load large model with multiple gguf files #2898

Closed hahuyhoang411 closed 2 months ago

hahuyhoang411 commented 4 months ago

Problem Jan is only support 1 gguf model file at a time

Success Criteria We can help users to merge gguf files into 1 and load the model for them

Additional context Approach https://www.reddit.com/r/LocalLLaMA/comments/1cf6n18/how_to_use_merge_70b_split_model_ggufpart1of2/

SwiftIllusion commented 4 months ago

Would also appreciate this as I have ran into the same limitation when trying to use the larger split models here - https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF#load-sharded-model - where it was specifically mentioned to load them as shared and not combine the files. Another reddit thread mentioning this https://www.reddit.com/r/LocalLLaMA/comments/1c2dfv6/loading_multipart_gguf_files_in/ referenced a resolution of this for text-generation-webui https://github.com/oobabooga/text-generation-webui/commit/e158299fb469dce8f11c45a4d6b710e239778bea (just for context and the steps to make it compatible here naturally may be different).

nguyenhoangthuan99 commented 2 months ago

This feature is already supported in llamacpp and Jan also supported it. But first of all, we need to make sure that the multi-part gguf file is splited properly. Examples:

./llama-gguf-split /path/to/model.gguf  /path/to/splitted/model

After finishing, we can see many shards of model under target folder.

model-00001-of-00003.gguf
model-00002-of-00003.gguf
model-00003-of-00003.gguf

When send request to server to load model, we only need to pass the path to first shard of multi-part gguf model.

curl http://localhost:3928/inferences/server/loadmodel \
  -H 'Content-Type: application/json' \
  -d '{
    "llama_model_path": "/path/to/model-00001-of-00003.gguf",
    "model": "meta-llama3.1-8b-instruct",
    "ctx_len": 512,
    "ngl": 300,
    "n_parallel":4,
  }'

Then the model can be loaded successfully. I'll close this issue.