Closed BjoernAkAManf closed 1 month ago
same problem here while try to run the GPT4ALL lib in a vps ( virtual private server )
Same:
Modified the example program to include this model: Meta-Llama-3.1-8B-Instruct-128k-Q4_0.gguf"
from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3.1-8B-Instruct-128k-Q4_0.gguf")
#model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf") # downloads / loads a 4.66GB LLM
with model.chat_session():
print(model.generate("How can I run LLMs efficiently on my laptop?", max_tokens=1024))
Failed to load llamamodel-mainline-cuda-avxonly.dll: LoadLibraryExW failed with error 0x7e Failed to load llamamodel-mainline-cuda.dll: LoadLibraryExW failed with error 0x7e
But then it runs:
To run Large Language Models (LLMs) efficiently on your laptop, consider the following suggestions:
Hardware Upgrades:
Software Optimizations:
Model Pruning and Quantization:
Batching and Parallel Processing:
Model Compilation and Deployment:
Regular Maintenance:
By implementing these strategies, you can efficiently run LLMs on your laptop and take advantage of their capabilities for various applications.
Edit: Windows 11, 32 gig 2080GPU, Python 11.6. GPT4All application seems to work fine.
@burhop You can ignore the two lines, they're harmless:
Failed to load llamamodel-mainline-cuda-avxonly.dll: LoadLibraryExW failed with error 0x7e
Failed to load llamamodel-mainline-cuda.dll: LoadLibraryExW failed with error 0x7e
If you don't have an Nvidia card and want to use CUDA, anyway. See: #2521.
This issue is about something else, though.
@BjoernAkAManf @h3ck4 Sorry, I guess this got kind of lost with other things going on. Have you been able to figure this out on your own? It's probably because the first prompt doesn't use any templates because it's outside a chat session. Have a look at the corresponding wiki page here.
So i am using the following code. While using it i was quite confused why the answers are not good. I know the computer i am using is sub-optimally, but for most workload it's fine.
There were some poor examples in the documentation for different LLMs in the past that may have falsely put users under the impression that this is something you should expect from all LLMs in general, without special prompting. You would like them to generate a short response, so you specify max_tokens=3
.
This is not correct—the model will generate whichever response it believes to be most likely, given the context provided. If you allow it to generate more tokens, you will see that it is generating completely reasonable sentences and paragraphs; just maybe not the words you are looking for.
There are a few ways to get what you want. One of them is to lead by example, using no chat session:
>>> x = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')
>>> x.generate('The capital of Germany is Berlin. The capital of France is', max_tokens=1, temp=0)
' Paris'
Or, you can use a chat session and tell the model what you're actually expecting:
>>> with x.chat_session(): x.generate('What is the capital of France? Respond in exactly one word.', max_tokens=1, temp=0)
'Paris'
Both of these techniques should work regardless of system prompt.
Documentation
So i am using the following code. While using it i was quite confused why the answers are not good. I know the computer i am using is sub-optimally, but for most workload it's fine.
Anyway. i am just using the default example (index.html) and are able to replicate getting a 2,0 whenever i ask about the capital in France. Without any system prompt specified, the result is not helpful either. However if i specify an empty system prompt it works.
I included multiple runs of the programm. I might be doing something wrong here, but i feel like the default code should work out-of-the-box? Maybe it's just my Machine though ¯_(ツ)_/¯ .