oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.83k stars 5.23k forks source link

API results way off comapred to the UI, same parameters, model, preset #6064

Open klaasvk opened 4 months ago

klaasvk commented 4 months ago

Describe the bug

Like the title says. I'm trying to run the api with a python script I made an make it write in csv. Now the prompt is the same as in the webui and it produces results according to the prompt but it looks like the models intelligence got cut in half. It hallucinates way more and doesn't execute the prompt as well as the WEBUI does with the same parameters.

I don't know what I'm doing wrong. Can someone help me out or relate? I am using the defaults 'chat mode' nd stuff even tho it says: 'It seems to be an instruction-following model' Would I need to specify that to in the python?

Thanks 🗡️

Is there an existing issue for this?

Reproduction

Model: nous-hermes-2-solar-10.7b.Q5_K_M.gguf

Python config (I tried the same settings as in the WEBUI):

data = { "prompt": prompt,
"max_tokens": 4000, "temperature": 1, "preset": "min_p", "top_p": 1, "min_p": 0.05, "stream": False }

Screenshot

No response

Logs

No logs but shorter token lenghts from API than WEBUI to.

System Info

RTX3060 12g baby
Koesn commented 4 months ago

I was struggling with this, but found a fix. That's because system prompt and the last message is not added to chat_dialogue in completions.py script. I add a bit of code to append it, check on completions.py patch. Check if it helps.

klaasvk commented 3 months ago

I was struggling with this, but found a fix. That's because system prompt and the last message is not added to chat_dialogue in completions.py script. I add a bit of code to append it, check on completions.py patch. Check if it helps.

I worked around it by specifying the chatML template for the model in the python myself. It now runs well, but still a bit inconsistent with output lengths, sometimes long mostly shorter.

Koesn commented 3 months ago

I worked around it by specifying the chatML template for the model in the python myself. It now runs well, but still a bit inconsistent with output lengths, sometimes long mostly shorter.

For some model like Miqu, the system prompt is not carried because the template on tokenizer_config.json not specifying system. So I also modify the template inside it.