can't use ggml-gpt4all-j-v1.3-groovy.bin

nilvaes commented 1 year ago

I wanted to use another llm but i had some errors as:

Screenshot 2023-06-14 173048

and this is my chatdocs.yml:

Screenshot 2023-06-14 173357

I already did pip install ctransformers and set CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

How can i use ggml-gpt4all-j-v1.3-groovy.bin? I'm sorry if these questions/problems are easy. I'm still a beginner on this subject but i really love the work you're putting on.

marella commented 1 year ago

I'm sorry if these questions/problems are easy. I'm still a beginner on this subject but i really love the work you're putting on.

Hey, no worries. Actually this is not an easy problem to figure out. I think gpt4all team changed their models and are using custom formats for their models instead of the standard ggml format, so it is not working with the ggml library.

Any reason you want to use the gpt4all-j model? I think the default model Wizard-Vicuna-7B-Uncensored is better than gpt4all-j and has similar size. Please note that only llama based models like Wizard-Vicuna support GPU. so gpt4all-j doesn't support GPU. If you want to use a gpt4all model you can try https://huggingface.co/TheBloke/GPT4All-13B-snoozy-GGML/tree/main which is also better than gpt4all-j.

Also any of the GGML models from https://huggingface.co/TheBloke will work.

nilvaes commented 1 year ago

My Specs: cpu: amd ryzen 5 2600x 6Core gpu: gtx 1660 super

i wanted to get faster responses. For now with gpu_layers: 30 i'm nearly using my all vrams and also using my cpu and i get a response in 37seconds.

What do you think about this one? ggml-gpt4all-l13b-snoozy.bin

marella commented 1 year ago

I don't think gpt4all-j will be faster than the default llama model. On Open LLM Leaderboard, gpt4all-13b-snoozy doesn't appear to be good compared to other 13B models like Wizard-Vicuna-13B-Uncensored Depending on your RAM you may or may not be able to run 13B models. RAM requirements are mentioned in the model card.

Recently some new quantization formats were released which significantly reduce the model size and require less memory. Try the ...q2_K.bin files from Wizard-Vicuna-7B-Uncensored-GGML and Wizard-Vicuna-13B-Uncensored-GGML. They will be faster but will have less quality.

chatdocs.yml:

ctransformers:
  model: TheBloke/Wizard-Vicuna-7B-Uncensored-GGML
  model_file: Wizard-Vicuna-7B-Uncensored.ggmlv3.q2_K.bin
  model_type: llama

Also try running with gpu_layers: 0. Sometimes running on just CPU can be faster if VRAM is not enough.

v4rm3t commented 1 year ago

First of all great work @marella . This library makes it so easy to install and run.

So, I have similar issue where my 2gb Nvidia Quadro P620 runs out of memory. And I am making a chatbot app for commercial usage, so which model can I use for it? I know that gpt4all-j models can be used, but the results are very poor with it. So how can I achieve that? (This is just for testing until I buy a cloud for commercial use)

nilvaes commented 1 year ago

hi, @mt-v . There is a list of commercial usable LLMs: https://github.com/eugeneyan/open-llms

I would recommend you to check and research which models are suitable for your project. If you want faster responds you need a better cpu ram or gpu, if you going to use it locally for now.

If your vram (gpu) runs out of memory, you should play with the gpu_layers: 50. I have gtx 1660 SUPER (6gb vram) and gpu_layers: 30 was the best solution for me.

I would love to hear your accomplishments throughout your project, keep me notified.

marella commented 1 year ago

Thanks @mt-v I hope nilvaes comment answered your questions.

@nilvaes if you are still looking for gpt4all-j model, you can use this file: https://huggingface.co/rustformers/gpt4all-j-ggml/blob/main/gpt4all-j-q4_0.bin which is in the standard ggml format.

chatdocs.yml:

ctransformers:
  model: rustformers/gpt4all-j-ggml
  model_file: gpt4all-j-q4_0.bin
  model_type: gptj

v4rm3t commented 1 year ago

@nilvaes @marella Thank you very much guys! This is exactly what I was looking for :)

I will keep you posted on the project. Once again, thanks for your response and this wonderful project!

v4rm3t commented 1 year ago

Hey guys! So I have upgraded to RTX 3060 12gb for testing the models. Do we have a support to configure this like an API server as you see in LocalAI. So that you can switch between different models, backends and OpenAI API.

marella commented 1 year ago

Hey, it uses WebSockets, so it doesn't have a REST API. See backend and frontend code for reference. Switching models might not be feasible because it will require more memory for each model loaded.

marella / chatdocs

can't use ggml-gpt4all-j-v1.3-groovy.bin #18