ollama / ollama

Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
89.66k stars 7.03k forks source link

Why Ollama is so terribly slow when I set format="json" #3851

Open marksalpeter opened 4 months ago

marksalpeter commented 4 months ago

What is the issue?

This is a duplicate of #3154, which was closed, I'm assuming, by mistake. The performance of the format="json" param is 10x slower than regular inference when additional context is included

A prompt like this takes ~24s to return on an NVIDIA T4 with CUDA enabled and format="json". The same exact prompt without format json takes ~2s to return. This has got to be a bug right?

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

${context}

Please respond in the following JSON schema
{
   "${schema.fieldName}": {
      "type": ${schema.type},
      "description": ${schema.description}
     }
}

Question: ${schema.description}
Helpful Answer:

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32

sebdg commented 4 months ago

I think this is related to a loop detection piece of code in the server.go code, the detection code allows the model to cycle for a number of tokens over whitespace, if the last token is repeated a number of times or only whitespace is detected for like 30 iterations it aborts the prediction, it would be useful if you could run your request while setting ollama in debug mode more about this here : troubleshooting

if this relates to the loop detection logic you will a line like 'prediction aborted, token repeat limit reached' in the log. on the other side some other bugs on stop detection relate to llama.cpp and could also be a cause.

I would be useful to try your request with streaming enabled this will show what the model returns

curl http://127.0.0.1:11434/api/generate -d '{ 
   "model": "llama3:8b", 
   "prompt": "You are a helpful writer, respond with an address in the US in JSON format.", 
   "stream": true, "format": "json" }'

Now as a workaround I would recommend on not using the format=json for now and just mention it in the prompt itself, depending on what your integration is you might be better by capturing the json part of the response by using a regex orso, I've had some flaws and inconsistent behaviors using format=json across different model, the regex might be a more robust solution to this

coder543 commented 4 months ago

I will say... I've observed that some models are slower with json mode than others. I'm not sure if it is a bug in the implementation, or if the models themselves are just trained in interesting ways.

Observing the streaming response, it seems to respond quickly, but then it waits around for awhile before deciding the message is complete. A well-defined grammar would realize that the JSON message is over, and immediately terminate, rather than waiting for some kind of end-of-stream token, and this could be the issue here.

not-nullptr commented 4 months ago

getting this exact issue on llama3:8b but not with mistral:latest, weirdly enough? speeds for regular text between both models are the exact same on my 3080. i think @coder543 is correct, except this is a bug in the implementation. why are we outputting whitespace in the json in the first place?

not-nullptr commented 4 months ago

https://github.com/ollama/ollama/assets/62841684/a91cb579-4160-445d-ad47-caf888f17a39

https://github.com/ollama/ollama/assets/62841684/fbc5a9b4-0113-4d2a-8467-5b24083433f7

the first video demonstrates my function calling without "format": "json", and the second demonstrates it with "format": "json". you can see the speed difference is insane; same prompt and everything.

coder543 commented 4 months ago

Unfortunately your video isn’t visually showing it generating JSON in both modes. If the model can’t respond with the correct JSON without JSON mode at least some of the time, it makes it harder to know for sure where the issue is in a piece of software like this.

It would also be helpful if the response were streaming (with a clear visual indication of when the streaming response has finished) so we could see if it pauses after generating the JSON, or if it is just generating the JSON really slowly character by character.

I would definitely like this slow JSON situation to be fixed.

mitar commented 3 months ago

There are some known upstream issues with grammar restrictions (which JSON format uses): https://github.com/ggerganov/llama.cpp/issues/4218