Calling generate before opening a chat session (with a system prompt specified) consistently malfunctions

BjoernAkAManf commented 3 months ago

Documentation

So i am using the following code. While using it i was quite confused why the answers are not good. I know the computer i am using is sub-optimally, but for most workload it's fine.

Anyway. i am just using the default example (index.html) and are able to replicate getting a 2,0 whenever i ask about the capital in France. Without any system prompt specified, the result is not helpful either. However if i specify an empty system prompt it works.

I included multiple runs of the programm. I might be doing something wrong here, but i feel like the default code should work out-of-the-box? Maybe it's just my Machine though ¯_(ツ)_/¯ .

from gpt4all import GPT4All

def send_message(model, prompt="The capital of France is "):
    output = model.generate(prompt, max_tokens=3)
    print(prompt + '\n\t' + output)

model = GPT4All("mistral-7b-openorca.gguf2.Q4_0.gguf", device="cpu")
send_message(model)

with model.chat_session():
    send_message(model)

with model.chat_session(''):
    send_message(model)

Failed to load llamamodel-mainline-cuda-avxonly.dll: LoadLibraryExW failed with error 0x7e
Failed to load llamamodel-mainline-cuda.dll: LoadLibraryExW failed with error 0x7e
The capital of France is 
    2,0
The capital of France is 
     The capital of
The capital of France is 
     Paris.

Process finished with exit code 0

Failed to load llamamodel-mainline-cuda-avxonly.dll: LoadLibraryExW failed with error 0x7e
Failed to load llamamodel-mainline-cuda.dll: LoadLibraryExW failed with error 0x7e
The capital of France is 
    1,0
The capital of France is 
     The capital of
The capital of France is 
     Paris.

Process finished with exit code 0

h3ck4 commented 2 months ago

same problem here while try to run the GPT4ALL lib in a vps ( virtual private server )

burhop commented 1 month ago

Same:

Modified the example program to include this model: Meta-Llama-3.1-8B-Instruct-128k-Q4_0.gguf"

from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3.1-8B-Instruct-128k-Q4_0.gguf")
#model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf") # downloads / loads a 4.66GB LLM
with model.chat_session():
    print(model.generate("How can I run LLMs efficiently on my laptop?", max_tokens=1024))

Failed to load llamamodel-mainline-cuda-avxonly.dll: LoadLibraryExW failed with error 0x7e Failed to load llamamodel-mainline-cuda.dll: LoadLibraryExW failed with error 0x7e

But then it runs:

To run Large Language Models (LLMs) efficiently on your laptop, consider the following suggestions:

Hardware Upgrades:
- Invest in a powerful CPU with multiple cores and high clock speeds.
- Upgrade to a dedicated graphics card or a GPU-enabled processor for better performance.
- Increase RAM capacity to at least 16 GB (32 GB or more recommended) for smoother operations.
Software Optimizations:
- Choose an LLM model that is optimized for your laptop's hardware capabilities.
- Utilize software frameworks like TensorFlow, PyTorch, or Keras, which are designed to run efficiently on various devices.
- Consider using cloud-based services (e.g., Google Colab) if you don't have the necessary hardware.
Model Pruning and Quantization:
- Reduce model complexity by pruning unnecessary weights and connections.
- Convert models from floating-point to integer representations for faster computations.
Batching and Parallel Processing:
- Process multiple inputs simultaneously using batching techniques.
- Leverage multi-threading or parallel processing capabilities of your CPU or GPU.
Model Compilation and Deployment:
- Compile models into optimized formats (e.g., ONNX, Core ML) for faster execution on various platforms.
- Deploy compiled models in cloud-based services like AWS SageMaker or Google Cloud AI Platform for scalable inference.
Regular Maintenance:
- Regularly update your operating system and software to ensure compatibility with the latest hardware features.
- Monitor resource usage and adjust settings as needed to maintain optimal performance.

By implementing these strategies, you can efficiently run LLMs on your laptop and take advantage of their capabilities for various applications.

Edit: Windows 11, 32 gig 2080GPU, Python 11.6. GPT4All application seems to work fine.

cosmic-snow commented 1 month ago

@burhop You can ignore the two lines, they're harmless:

Failed to load llamamodel-mainline-cuda-avxonly.dll: LoadLibraryExW failed with error 0x7e
Failed to load llamamodel-mainline-cuda.dll: LoadLibraryExW failed with error 0x7e

If you don't have an Nvidia card and want to use CUDA, anyway. See: #2521.

This issue is about something else, though.

@BjoernAkAManf @h3ck4 Sorry, I guess this got kind of lost with other things going on. Have you been able to figure this out on your own? It's probably because the first prompt doesn't use any templates because it's outside a chat session. Have a look at the corresponding wiki page here.

cebtenzzre commented 1 month ago

So i am using the following code. While using it i was quite confused why the answers are not good. I know the computer i am using is sub-optimally, but for most workload it's fine.

There were some poor examples in the documentation for different LLMs in the past that may have falsely put users under the impression that this is something you should expect from all LLMs in general, without special prompting. You would like them to generate a short response, so you specify max_tokens=3.

This is not correct—the model will generate whichever response it believes to be most likely, given the context provided. If you allow it to generate more tokens, you will see that it is generating completely reasonable sentences and paragraphs; just maybe not the words you are looking for.

There are a few ways to get what you want. One of them is to lead by example, using no chat session:

>>> x = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')
>>> x.generate('The capital of Germany is Berlin. The capital of France is', max_tokens=1, temp=0)
' Paris'

Or, you can use a chat session and tell the model what you're actually expecting:

>>> with x.chat_session(): x.generate('What is the capital of France? Respond in exactly one word.', max_tokens=1, temp=0)
'Paris'

Both of these techniques should work regardless of system prompt.

nomic-ai / gpt4all

Calling generate before opening a chat session (with a system prompt specified) consistently malfunctions #2450

Documentation