nomic-ai / gpt4all

GPT4All: Chat with Local LLMs on Any Device
https://gpt4all.io
MIT License
66.69k stars 7.34k forks source link

The tokens are not produced when running model.generate in GPT4All Python Generation API #1796

Closed ElMouse closed 6 months ago

ElMouse commented 6 months ago

System Info

Windows 11, Python 310, GPT4All Python Generation API

Information

Reproduction

Using GPT4All Python Generation API. I am facing a strange behavior, for which i cant find explanation and it is really frustrating. Seems like i am missing some obvious thing and i feel like an idiot.

Just simply using

model = GPT4All(model_name="mistral-7b-instruct-v0.1.Q4_0.gguf", verbose = True)
output = model.generate("prompt", max_tokens=300 )
print (output)

sometimes produces empty result. like literally output=''

I looked into gpt4all code and it seems that

        def _callback_wrapper(
            callback: pyllmodel.ResponseCallbackType,
            output_collector: List[MessageType],
        ) -> pyllmodel.ResponseCallbackType:
            def _callback(token_id: int, response: str) -> bool:
                nonlocal callback, output_collector
                #print(f'Token ID: {token_id}, Response: {response}')

                output_collector[-1]["content"] += response

                return callback(token_id, response)

            return _callback

is not called... or works... every time. The tokens sometomes are just not produced. And by not produced i mean not that the generation works indefinatly. Just output = model.generate("prompt", max_tokens=300 ) works for some time and just produces no tokens and empty result. During this it works for about the same amount if time when it IS producing the answer.

Whats even more weird is that if i loop output = model.generate("prompt", max_tokens=300 )

it can produce the result from the first try, or sometimes from the 3 or 4th try. Or sometimes not even from the 100th time. With the same prompt.

I seems i am missing something. Tried different parameters, different prompts, different models.

For the life of me i cant figure out the logic - when it works and when it doesnt.

At the same time the desktop version produces answers every time all the time, without problems, with the same models and the same prompts.

And output = model.generate("prompt", max_tokens=300 ) doesn't produce any logs... even with verbose mode. I am really puzzled...

Any pointers are appreciated. I am desperate here. Thanks in advance.

Expected behavior

at least some answer is produced every time the output = model.generate("prompt", max_tokens=300 ) is executed.

ElMouse commented 6 months ago

OK it has to be something in pyllmodel.py on the callback stuff, but it looks like rocket science for me for now.

ElMouse commented 6 months ago

OK, i found a solution

changing

output = model.generate("prompt", max_tokens=300 )
print (output)

to

with model.chat_session():
    model.generate("prompt", max_tokens=300)
    output = model.current_chat_session[2]['content']
print (output)

resolves the issue, and i get output every time. Not closing the issue, as i suppose it was intended that output = model.generate("prompt", max_tokens=300 ) Should also work normally?

wrinkledeth commented 6 months ago

@ElMouse You're a real one for troubleshooting this. 19 hours before i needed it as well. Thanks so much!!!

cebtenzzre commented 6 months ago

This is not unexpected - note that the parameter is "max_tokens", not "tokens". If an LLM decides that the output should end, it sends the EOS token before max_tokens is hit. A chat session is usually what you want - and you should be able to just take the output from the generate call when it is wrapped, without touching model.current_chat_session.