Closed hahuyhoang411 closed 2 months ago
Would also appreciate this as I have ran into the same limitation when trying to use the larger split models here - https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF#load-sharded-model - where it was specifically mentioned to load them as shared and not combine the files. Another reddit thread mentioning this https://www.reddit.com/r/LocalLLaMA/comments/1c2dfv6/loading_multipart_gguf_files_in/ referenced a resolution of this for text-generation-webui https://github.com/oobabooga/text-generation-webui/commit/e158299fb469dce8f11c45a4d6b710e239778bea (just for context and the steps to make it compatible here naturally may be different).
This feature is already supported in llamacpp and Jan also supported it. But first of all, we need to make sure that the multi-part gguf file is splited properly. Examples:
./llama-gguf-split /path/to/model.gguf /path/to/splitted/model
After finishing, we can see many shards of model under target folder.
model-00001-of-00003.gguf
model-00002-of-00003.gguf
model-00003-of-00003.gguf
When send request to server to load model, we only need to pass the path to first shard of multi-part gguf model.
curl http://localhost:3928/inferences/server/loadmodel \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/model-00001-of-00003.gguf",
"model": "meta-llama3.1-8b-instruct",
"ctx_len": 512,
"ngl": 300,
"n_parallel":4,
}'
Then the model can be loaded successfully. I'll close this issue.
Problem Jan is only support 1 gguf model file at a time
Success Criteria We can help users to merge gguf files into 1 and load the model for them
Additional context Approach https://www.reddit.com/r/LocalLLaMA/comments/1cf6n18/how_to_use_merge_70b_split_model_ggufpart1of2/