Problem with phi-3.5-mini-instruct chat template and endless generation

dlippold commented 4 weeks ago

Bug Report

I want to use the new model Phi-3.5-mini-instruct and downloaded the file Phi-3.5-mini-instruct-Q5_K_M.gguf from https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF

There is the following prompt format stated:

<|system|> {system_prompt}<|end|><|user|> {prompt}<|end|><|assistant|>

Therefore I used that information in the settings. As the result I have the following section in the file GPT4All.ini :

[model-Phi-3.5-mini-instruct-Q5_K_M.gguf] filename=Phi-3.5-mini-instruct-Q5_K_M.gguf name=Phi-3.5-mini-instruct promptTemplate=<|user|>\n%1<|end|>\n<|assistant|>\n systemPrompt=<|system|>\nYou are a helpful assistant.<|end|>\n

But when I use the model sometimes after the answer an new question is automatically generated an answered. I suppose the reason for that has to do with the prompt template or with the processing of the prompt template.

Can I modify the prompt template for the correct function of this model (and similar for other models I download from Hugging Face)?

There seems to be information about the prompt template in the GGUF meta data. Would it be possible that this information is automatically used by GPT4All?

Steps to Reproduce

Download the model stated above
Add the above cited lines to the file GPT4All.ini
Start GPT4All and load the model Phi-3.5-mini-instruct
Ask a simple question (maybe several times)

Expected Behavior

Only the questions from the user should be answered and no new question or task should be generated.

Your Environment

GPT4All version: 3.2.1
Operating System: Linux
Chat model used (if applicable): Phi-3.5-mini-instruct

vap0rtranz commented 3 weeks ago

But when I use the model sometimes after the answer an new question is automatically generated an answered.

Are you talking about the new auto-generated follow-up questions?

If you are, then those can be disabled.

In GUI, under Settings -> Application, set Suggestion Mode to Never:

dlippold commented 3 weeks ago

Thanks for the question, but I had set that to Never.

I did some more tests and interestingly now only new answers were generated without new questions (in the previous tests I modified templates several times, maybe questions were generated for certain prompt templates). Here is one new example (I had to stop the generation):

Prompt: Who was Alan Turing? Response: Alan Turing was an English mathematician, logician, cryptanalyst, and computer scientist widely considered to be the father of theoretical computer science and artificial intelligence (AI). Born on June 23, 1912, in Maida Vale, London, he made significant contributions during World War II by working at Bletchley Park where his work led directly or indirectly into saving countless lives.

[some text deleted]

Alan's story serves as an inspiration not only in science but also humanity at large—a reminder that no matter how daunting challenges may seem, with passionate dedication one can overcome them & leave behind a lasting legacy for others to follow suit!<|end|><|assistant|> Alan Turing was indeed among the most influential figures of his time. His contributions in mathematics and computer science laid down many foundational principles which are still being used today:

1) Turing Machines - [text deleted]

Please consider the generated text <|end|><|assistant|> between the two answers.

Then I extended the prompt template so that there were in following lines in the file GPT4All.ini:

[model-Phi-3.5-mini-instruct-Q5_K_M.gguf] filename=Phi-3.5-mini-instruct-Q5_K_M.gguf name=Phi-3.5-mini-instruct promptTemplate=<|user|>\n%1<|end|>\n<|assistant|>\n%2<|end|>\n systemPrompt=<|system|>\nYou are a helpful assistant.<|end|>\n

That didn't resolved the problem. Here is an example (I had to stop the generation again):

Prompt: Who was Alan Turing? Response: Alan Turing was an English mathematician, logician, cryptanalyst, and computer scientist widely considered to be the father of theoretical computer science and artificial intelligence (AI). Born on June 23, 1912, in Maida Vale, London, he made significant contributions during World War II by working at Bletchley Park where his work was crucial for deciphering German Enigma-encrypted messages.

[some text deleted]

His story continues to inspire people across disciplines and generational divides—his life serves both cautionary tale about societal prejudices but also testament of human potential when nurtured in an environment that values intellectual curiosity, innovation & progress over conformity or bigotry; his memory lives on as a beacon for those who dare to dream and push boundaries beyond conventional limits. Turing's life serves both cautionary tale about societal prejudices but also testament of human potential when nurtured in an environment that values intellectual curiosity, innovation & progress over conformity or bigotry; his memory lives on as a beacon for those who dare to dream and push boundaries beyond conventional limits. Turing's life serves both cautionary tale about societal prejudices but also testament of human potential when nurtured in an environment that values intellectual curiosity, innovation & progress over conformity or bigotry; his memory lives on as a beacon for those who dare to dream and push boundaries beyond conventional limits. Turing's life serves both cautionary tale about societal prejudices but also testament of human potential when nurtured in an environment that values intellectual curiosity, innovation & progress over conformity or bigotry; his memory lives on as a beacon for those who dare to dream and push boundaries beyond conventional limits.

[last sentece is repeated again and again]

PLease pay attention to the fact that there is no generated text <|end|><|assistant|> anymore.

My assumption is now that the problem is that the stop token from the model is not handled correctly.

ThiloteE commented 3 weeks ago

Please try https://huggingface.co/GPT4All-Community/Phi-3.5-mini-instruct-GGUF

vap0rtranz commented 3 weeks ago

That didn't resolved the problem. Here is an example (I had to stop the generation again):

Ah, I understand now.

I reproduced this by downloading that model (the Phi3-medium variant) from HuggingFace, setting the standard prompt per the model card, and I had the same issue with the prompt text inserting itself into the output/reply to my session.

Looks like ThiloteE is suggesting to use the Community built variant of this model.

dlippold commented 3 weeks ago

@ThiloteE: Thanks for supplying the model. I tried the Q6_K version.

The result is:

Sometimes it works well, i.e. it stops generation of text by itself.
Sometimes it repeats some text again and again as before.

Here is an example for the second case:

Prompt: Who was Alan Turing? Response: Alan Turing was an English mathematician, logician, cryptanalyst, and computer scientist who is widely considered to be the father of theoretical computing and artificial intelligence (AI). Born on June 23, 1912, in Maida Vale, London, he made significant contributions during World War II by helping decipher German codes at Bletchley Park. His work was crucial for shortening the war but remained classified until decades later due to its sensitive nature and potential impact on national security if disclosed prematurely or publicized without proper contextual understanding of his role in wartime efforts, which is why he became known as a "gentleman spy."

[some text deleted]

His story serves as both cautionary tale about how prejudice can stifle innovation and progress but also hopeful reminder that change is possible if people are willing to fight for it—even when facing seemingly insurmountable odds against them like Alan Turing did during his lifetime. In essence, without the genius of this remarkable individual whose life was cut tragically short by circumstances beyond anyone's control yet still managed not only survive but thrive despite being marginalized because he lived at a time when society didn’t accept him—we might have been living in an entirely different world today with less technological advancement than we enjoy now. His story serves as both cautionary tale about how prejudice can stifle innovation and progress but also hopeful reminder that change is possible if people are willing to fight for it—even when facing seemingly insurmountable odds against them like Alan Turing did during his lifetime. In essence, without the genius of this remarkable individual whose life was cut tragically short by circumstances beyond anyone's control yet still managed not only survive but thrive despite being marginalized because he lived at a time when society didn’t accept him—we might have been living in an entirely different world today with less technological advancement than we enjoy now. His story serves as both cautionary tale about how prejudice can stifle innovation and progress but also hopeful reminder that change is possible if people are willing to fight for it—even when facing seemingly insurmountable odds against them like Alan Turing did during his lifetime.

[the last two senteces are repeated again and again]

In my test I did the following:

Started GPT4All
Asked the question above
Waited for the answer to complete or stopped the procession of text.
Executed the funktion Erase and reset chat session
Repeated from step 2 on

My impression was that some first cycles (step 2 up to step 4) after starting GPT4All worked well but then the next cycles didn't stopped. But that could have been a coincidence.

Could it be that the model produces several stop tokens and only some of them are processed correctly by GPT4All?

ThiloteE commented 3 weeks ago

Have you also changed the prompt template to the one that I suggested? Or do you still use the old one from Bartowski?

I would suggest this chat template for the GPT4All-Community version:

<|user|>
%1<|end|>
<|assistant|>
%2<|end|>

Yes, the model is trained to use multiple stop tokens and GPT4All only can parse one of them unfortunately. I believe the template and the one that I provided should be the one that is triggered earlier in their chat template and more often than the other one, but stopping the generation early could confuse the model, so it's a little bit of a hack probably. In my personal tests I encountered zero problems though, otherwise I would not have uploaded the model. It is a little disheartening to hear that you report the model still behaves abnormaly, because that means my methodology to create those quants might be far from perfect.

The devs at Nomic know about the problem with parsing eos in the chat templates and are (probably) working on a fix, but it is not ready at present time.

dlippold commented 3 weeks ago

Yes, I used the prompt template defined on the web site of the model and cited in your comment.

I just checked my impression that the first answer after starting GPT4All works well, i.e. the generation stops automatically. Unfortunately, the impression is not generally true.

Thanks for your work again. I think it is an important part for the investigation of the problem and for solving it in the end.

ThiloteE commented 3 weeks ago

Seems like an upstream issue with llama.cpp. See https://github.com/ggerganov/llama.cpp/issues/9127. It has been reported to start at 4096 tokens. There are also more reports at the Kobolcpp repository. It has been suggested that it might be related to rope scaling, which is a technique to extend context length.

Since people with the default prompt template have these issues too, the cause is probably not my quantization method. Phew

eltay89 commented 2 weeks ago

try this , leave the system prompt empty

use this for prompt template

### Human:  %1

### Assistant: %2

nomic-ai / gpt4all