nomic-ai / gpt4all

GPT4All: Chat with Local LLMs on Any Device
https://gpt4all.io
MIT License
67.68k stars 7.44k forks source link

GPT4ALL prompt taking too long #973

Open VishnuAK9000 opened 1 year ago

VishnuAK9000 commented 1 year ago

Issue you'd like to raise.

I am trying to use GPT4ALL prompting using langchain from following link

https://python.langchain.com/en/latest/modules/models/llms/integrations/gpt4all.html

I am running it in google colab in normal mode. But the reponse is taking too long like 30-40 min why is that? Am I doing something wrong?

Suggestion:

No response

DJMo13 commented 1 year ago

gpt4all currently uses the cpu, which is very limited in google colab. using huging face spaces or run it on your own computer

berkut1 commented 1 year ago

@DJMo13 it also taking too long on local PC (especially with long context) for example with i9 9900k, when with alternative solution "oobabooga" it takes literally x10 time faster (and with gpu x30 time faster).

VishnuAK9000 commented 1 year ago

from langchain.vectorstores import Chroma from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import HuggingFaceEmbeddings from langchain.chains.question_answering import load_qa_chain from langchain.chains import RetrievalQA from langchain.llms import GPT4All from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler callbacks = [StreamingStdOutCallbackHandler()] from langchain import PromptTemplate,LLMChain

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50) texts = text_splitter.split_documents(documents)

db = Chroma.from_documents(texts, embeddings)

local_path = './ggml-gpt4all-j-v1.3-groovy.bin' llm = GPT4All(model=local_path,backend='gptj',callbacks=callbacks, verbose=False)

chain = load_qa_chain(llm, chain_type="stuff")

query = "What is the command to start angular application?" docs = db.similarity_search(query)

chain.run(input_documents=docs, question=query)

Above this the code I am trying to run its taking more than 15-20 min for respone, I am running it on my local system

Ram:16GB Processeor: Intel Core i5 -10310U 1.70 GHz

am I doing something wrong?

cosmic-snow commented 1 year ago

maybe try this first:

Also, a general tip: monitor your RAM usage while testing. Although it shouldn't be a proble with 16GB.

VishnuAK9000 commented 1 year ago

Hi I tried that but still getting slow response. I think its issue with my CPU maybe. But also one more doubt I am starting on LLM so maybe I have wrong idea I have a CSV file with Company, City, Starting Year. When I query GPT4All with name the location of company X its works fine. But When I say name all companies in City X, its giving completely wrong answer.

Also I what is difference between snoozy and groovy. And I have also seen some bin file which is has quantised in its name. What is the difference there?

Sherlock907 commented 1 year ago

I think its issue with my CPU maybe.

Whats your cpu, im on Gen10th i3 with 4 cores and 8 Threads and to generate 3 sentences it takes 10 minutes. All we can hope for is that they add Cuda/GPU support soon or improve the algorithm. If i take cpu.userbenchmarks into account, the fastest possible intel cpu is 2.8x faster than mine, which would reduce generation time from 10 minutes down to 2.5 minutes for 3 sentences, which is still extremly slow. Cuda support is needed for everyone who is not on apple silicon!

cosmic-snow commented 1 year ago

Have you actually tried playing around with batch size and thread number in the code? 10 minutes sounds like something's wrong. Do you have enough RAM? What model were you using?

Sherlock907 commented 1 year ago

Have you actually tried playing around with batch size and thread number in the code? 10 minutes sounds like something's wrong. Do you have enough RAM? What model were you using?

Just set it from 9 to 12 to 120, nothing changed the next generated message still used 100% cpu (all threads) and all settings used 12gb of Ram (16gb available) without any noticeable change in terms of speed. Im using StableVicuna 13B Model. I think groovy is a bit faster but still very slow.

When i prompt "explain dynamic programming" it takes 15 seconds for the first word to appear and almost a minute to finish the first sentence. After 6:50 Minutes it is finished and i get this, its fairly long this time but it makes also my computer almost unuseable for the duration.

"Dynamic Programming is a technique used in computer science to solve optimization problems by breaking them down into smaller sub-problems. It involves storing the solutions of sub-problems and using them to calculate the optimal solution for the larger problem. The key idea behind Dynamic Programming is that it allows us to avoid recomputing the same values multiple times, which can be very computationally expensive. Dynamic Programming is often used in algorithms such as the Longest Common Subsequence (LCS) algorithm and the Knapsack Problem. It is also used in game theory, where it can be used to calculate optimal strategies for games like chess or tic-tac-toe. In summary, Dynamic Programming is a powerful technique that allows us to solve complex optimization problems by breaking them down into smaller sub-problems and using the solutions of those sub-problems to compute the optimal solution for the larger problem."

Even with some optimisations i dont see away around cuda support, its the same with openai's whispers or stable-diffusion, they can run on cpu, but are so much faster using the gpu.

cosmic-snow commented 1 year ago

Just set it from 9 to 12 to 120, nothing changed the next generated message still used 100% cpu (all threads) and all settings used 12gb of Ram (16gb available) without any noticeable change in terms of speed. Im using StableVicuna 13B Model. I think groovy is a bit faster but still very slow.

At least we can rule out a RAM problem then. But yes, things can be slow, especially on weaker/older CPUs. But 10min for 3 sentences struck me as a little too much.

Also, I don't know if that thing supports AVX2, either.

Your best best is probably to use 6/7B models instead. Bigger can be better, but is certainly slower.

By the way, this is issue was originally opened with bindings in mind. Are you using that or the chat application? Don't only set thread count, but also play around with batch size, that might help a bit, too.

And yes, GPU is definitely faster, I've already seen that. But it's WIP so far.

berkut1 commented 1 year ago

All we can hope for is that they add Cuda/GPU support soon or improve the algorithm.

@Sherlock907 the problem is in the code. You can check out the "oobabooga" alternative client and see how much faster it is on the CPU with GGML models.

There are at least 2 problems.

  1. The client does not immediately load the model into RAM. Therefore, the first run of the model can take at least 5 minutes. It also loads the model very slowly. It's not normal to load 9 GB from an SSD to RAM in 4 minutes.
  2. Always clears the cache (at least it looks like this), even if the context has not changed, which is why you constantly need to wait at least 4 minutes to get a response.

PS: My HW i9 9900k and 32 GB RAM.

Sherlock907 commented 1 year ago

You can check out the "oobabooga"

Thanks just installed it, but i cant compare cpu times as i installed it for nvidia right away and now i get answers within seconds and not minutes. Despite that you are probably right something is wrong in the code, as gpt4all caused at times 100% cpu usage even when it was in idle and no request was handled.

Also, I don't know if that thing supports AVX2, either.

Yes my cpu the supports Avx2, despite being just an i3 (Gen. 10), it can be compared with i7 from gen. 7/8 (or earlier) as it has 4/8 Cores/Threads and performance quite the same. It might not be a beast but it isnt exactly slow either.

berkut1 commented 1 year ago

i installed it for nvidia right away and now i get answers within seconds and not minutes

I don't think so, GPU support for GGML by default disabled and you should enable it by your self with building your own library (you can check their documentation for llama.cpp). Or if you did that, then fine :).

P.S You also get an answer in seconds with CPU too.

100% cpu usage even when it was in idle and no request was handled.

That could happens if you set too much threads in the settings. I recommend to keep free at least 1 thread.

cosmic-snow commented 1 year ago

The client does not immediately load the model into RAM. Therefore, the first run of the model can take at least 5 minutes. It also loads the model very slowly. It's not normal to load 9 GB from an SSD to RAM in 4 minutes.

That's definitely not normal and not what happens on my end.

... Despite that you are probably right something is wrong in the code, as gpt4all caused at times 100% cpu usage even when it was in idle and no request was handled.

That's definitely not normal, either.

I'm a bit puzzled by the problems you guys are having.

berkut1 commented 1 year ago

@cosmic-snow I exaggerated a little bit.

It will be more accurate. Client: GPT4ALL Model: stable-vicuna-13b

Client: oobabooga with the only CPU mode. Model: wizard-vicuna-13b-ggml

I even reinstalled GPT4ALL and reseted all settings to be sure that it's not something with software/settings.

CPU: i9 9900k OS: Windows 10

cosmic-snow commented 1 year ago

@cosmic-snow I exaggerated a little bit.

It will be more accurate. Client: GPT4ALL Model: stable-vicuna-13b

  • load time into RAM, ~2 minutes and 30 sec (that extremely slow)
  • time to response with 600 token context - ~3 minutes and 3 second

Client: oobabooga with the only CPU mode. Model: wizard-vicuna-13b-ggml

  • load time into RAM, - 10 second.
  • time to response with 600 token context - the first attempt is ~30 seconds, the next attempts generate a response after 2 second, and if the context has been changed, then after ~10 seconds.

I even reinstalled GPT4ALL and reseted all settings to be sure that it's not something with software/settings.

CPU: i9 9900k OS: Windows 10

I really don't know what's happening on your machine, but here's a comparison with mine:

Maybe try and compare with only the bindings, or even upstream llama.cpp?

Edit: Also, what happens if you restart your PC and run nothing else, to make sure the RAM isn't already in use by anything?

Edit 2: If you have a 600 token context then yes, that will take some time to get started.

berkut1 commented 1 year ago

The results are exactly the same after a reboot.

I think oobabooga uses llama.cpp through the llama-cpp-python

Edit 2: If you have a 600 token context then yes, that will take some time to get started.

Does it take minutes too? If so, then that's not good either.

cosmic-snow commented 1 year ago

Does it take minutes too? If so, then that's not good either.

The time it takes is in relation to how fast it generates afterwards. And it depends on a number of factors: the model/size/quantisation/available resources/threads/batch size/... There is no single answer to that. Input is processed in a similar way to output. You should read up on the transformer architecture to get a better idea about that.

Edit: But anyway, right now I'm unsure why you're seeing the problem you're seeing. Maybe I'll have to try with that Vicuna model and maybe you should try with snoozy or the smaller groovy to have something to compare.

berkut1 commented 1 year ago

The time it takes is in relation to how fast it generates afterwards. And it depends on a number of factors: the model/size/quantisation/available resources/threads/batch size/... There is no single answer to that.

Of course you're right. Over the past month I have tried at least fifty models, and all of which are offered by GPT4ALL. In all cases, the "oobabooga" client started generating responses with large (about 400-700 tokens) contexts almost instantly, or no more than 1 minute.

So I have obvious reason to doubt that "GPT4ALL" works effectively.

Edit; Or the reason is in my system, then it is strange that the other client works fine.

cosmic-snow commented 1 year ago

So I have obvious reason to doubt that "GPT4ALL" works effectively.

I guess from your perspective it looks like that. And I do wonder why that is.

As I said, maybe try the bindings with the Python example? I could upload a video of how that looks here and how it isn't slow, but that won't really help you with your problem.

berkut1 commented 1 year ago

@cosmic-snow Fine. You have given some ideas and I have just tested them.

You're right. The model loads quickly and generates a response instantly. However, if the context takes no more than 10 tokens (or 1-2 sentences).

With a large context, the model loads slowly, because for a strange reason, the client immediately starts trying to generate a response without waiting for the entire model to load, which causes the CPU to be overloaded to efficiently load the model into RAM. But why the client works so terribly with a huge context is not clear to me.

cosmic-snow commented 1 year ago

With a large context, the model loads slowly, because for a strange reason, the client immediately starts trying to generate a response without waiting for the entire model to load, which causes the CPU to be overloaded to efficiently load the model into RAM. But why the client works so terribly with a huge context is not clear to me.

Ah, you're saying the model loading process and generation are getting into each other's way? I'm not too familiar with the C++ code of this project (not a maintainer, either), but at least that'd be something to investigate further.

berkut1 commented 1 year ago

Ah, you're saying the model loading process and generation are getting into each other's way?

@cosmic-snow Yes, you can test it too. Write a big context 600 tokens or ~ 20 sentences. and open a task manager of RAM.

Also, I have a very strong reason to suspect that the model is reloaded with each response, but this is not visible in the task manager. I think so because it is very similar behavior as at the initial start.

cosmic-snow commented 1 year ago

Yes, you can test it too. Write a big context 600 tokens or ~ 20 sentences. and open a task manager of RAM.

Also, I have a very strong reason to suspect that the model is reloaded with each response, but this is not visible in the task manager. I think so because it is very similar behavior as at the initial start.

I did test a few things, but it doesn't look like anything is out of the ordinary. As I said, if you're inserting a long input, it's going to take a while. It's basically going through the input quite similarly as it will when producing output. You just don't see what it's working on behind the scenes.

Now maybe there's another thing that's not clear: There were breaking changes to the file format in llama.cpp, but GPT4All keeps supporting older files through older versions of llama.cpp. If you've downloaded your StableVicuna through GPT4All, which is likely, you have a model in the old version. That's not a problem in itself, but it won't see any iterative improvements made in llama.cpp since they broke compatibility.

Maybe try one of the models with v3 in its name and see how that goes.

Also, you can try to enable the llama.cpp GPU support in in this project, too, if you're building yourself. But you'll have to figure it out by yourself, I won't be trying to help with that (I have not attempted it myself, recently). You'd need to look at gpt4all-backend\lama.cpp.cmake.

I have not tried the other project yet and I'm not going to do that today, nor am I going to compare what might be different between the two right now. But if you can figure it out, then of course don't hesitate to let others here know, too.

VishnuAK9000 commented 1 year ago

Hi,

I don't know what is the issue. But when I run privateGPT I am able to get response in 130sec. But when I run same code in Jupiter its takes about 10 min for response. Also I have seen that GPT4all really struggles with excel data. When I tried to create a question answering bot. What I found was that the retrieved data using similarity search is correct but when same data is fed to GPT4all with query even with source data, its giving wrong answer. Same for privateGPT implementation.

cosmic-snow commented 1 year ago

Hi,

I don't know what is the issue. But when I run privateGPT I am able to get response in 130sec. But when I run same code in Jupiter its takes about 10 min for response. Also I have seen that GPT4all really struggles with excel data. When I tried to create a question answering bot. What I found was that the retrieved data using similarity search is correct but when same data is fed to GPT4all with query even with source data, its giving wrong answer. Same for privateGPT implementation.

It's really hard to tell what exactly goes wrong in these cases. For example, it could be that you had two models open at the same time to compare and were then left with not enough RAM so one of them would try to swap and be heavily slowed down. Or there might be a process limit somewhere. Or one of them has the thread count set properly, while the other doesn't.

Also, unless a problem is directly tested on this project and not what others build on top, it could be a problem at any point in that stack. It might not be something that's possible to track down without the exact same setup and a debugger/profiler.

I can't really help you with privateGPT or LocalDocs, either. I haven't used those yet.

berkut1 commented 1 year ago

Well, version 2.4.9 seems to have fixed some issue and 2-3 times faster initial response with big context (now it takes ~1 minute, against 2-3 minutes in previous versions). However, there is still a problem: if you re-generate with the same context text, it does not generate a response immediately (as it should), but again needs to wait ~1 minute.