Closed nilvaes closed 1 year ago
I'm sorry if these questions/problems are easy. I'm still a beginner on this subject but i really love the work you're putting on.
Hey, no worries. Actually this is not an easy problem to figure out. I think gpt4all team changed their models and are using custom formats for their models instead of the standard ggml format, so it is not working with the ggml library.
Any reason you want to use the gpt4all-j model? I think the default model Wizard-Vicuna-7B-Uncensored is better than gpt4all-j and has similar size. Please note that only llama based models like Wizard-Vicuna support GPU. so gpt4all-j doesn't support GPU. If you want to use a gpt4all model you can try https://huggingface.co/TheBloke/GPT4All-13B-snoozy-GGML/tree/main which is also better than gpt4all-j.
Also any of the GGML models from https://huggingface.co/TheBloke will work.
My Specs: cpu: amd ryzen 5 2600x 6Core gpu: gtx 1660 super
i wanted to get faster responses. For now with gpu_layers: 30 i'm nearly using my all vrams and also using my cpu and i get a response in 37seconds.
What do you think about this one? ggml-gpt4all-l13b-snoozy.bin
I don't think gpt4all-j will be faster than the default llama model. On Open LLM Leaderboard, gpt4all-13b-snoozy doesn't appear to be good compared to other 13B models like Wizard-Vicuna-13B-Uncensored Depending on your RAM you may or may not be able to run 13B models. RAM requirements are mentioned in the model card.
Recently some new quantization formats were released which significantly reduce the model size and require less memory.
Try the ...q2_K.bin
files from Wizard-Vicuna-7B-Uncensored-GGML and Wizard-Vicuna-13B-Uncensored-GGML. They will be faster but will have less quality.
chatdocs.yml
:
ctransformers:
model: TheBloke/Wizard-Vicuna-7B-Uncensored-GGML
model_file: Wizard-Vicuna-7B-Uncensored.ggmlv3.q2_K.bin
model_type: llama
Also try running with gpu_layers: 0
. Sometimes running on just CPU can be faster if VRAM is not enough.
First of all great work @marella . This library makes it so easy to install and run.
So, I have similar issue where my 2gb Nvidia Quadro P620 runs out of memory. And I am making a chatbot app for commercial usage, so which model can I use for it? I know that gpt4all-j models can be used, but the results are very poor with it. So how can I achieve that? (This is just for testing until I buy a cloud for commercial use)
hi, @mt-v . There is a list of commercial usable LLMs: https://github.com/eugeneyan/open-llms
I would recommend you to check and research which models are suitable for your project. If you want faster responds you need a better cpu ram or gpu, if you going to use it locally for now.
If your vram (gpu) runs out of memory, you should play with the gpu_layers: 50
. I have gtx 1660 SUPER (6gb vram) and gpu_layers: 30
was the best solution for me.
I would love to hear your accomplishments throughout your project, keep me notified.
Thanks @mt-v I hope nilvaes comment answered your questions.
@nilvaes if you are still looking for gpt4all-j model, you can use this file: https://huggingface.co/rustformers/gpt4all-j-ggml/blob/main/gpt4all-j-q4_0.bin which is in the standard ggml format.
chatdocs.yml
:
ctransformers:
model: rustformers/gpt4all-j-ggml
model_file: gpt4all-j-q4_0.bin
model_type: gptj
@nilvaes @marella Thank you very much guys! This is exactly what I was looking for :)
I will keep you posted on the project. Once again, thanks for your response and this wonderful project!
Hey guys! So I have upgraded to RTX 3060 12gb for testing the models. Do we have a support to configure this like an API server as you see in LocalAI. So that you can switch between different models, backends and OpenAI API.
I wanted to use another llm but i had some errors as:
and this is my chatdocs.yml:
I already did
pip install ctransformers
andset CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers
How can i use ggml-gpt4all-j-v1.3-groovy.bin? I'm sorry if these questions/problems are easy. I'm still a beginner on this subject but i really love the work you're putting on.