simonw / llm-llama-cpp

LLM plugin for running models using llama.cpp
Apache License 2.0
136 stars 19 forks source link

Output does not stream #11

Closed bvanslyke closed 1 year ago

bvanslyke commented 1 year ago

Rather than streaming output I see all of the output show up at the end, all at once.

But, if I add a print statement before an output item is yielded, then I'll see text generated line-by-line from my print().

for item in stream:
    # Each item looks like this:
    # {'id': 'cmpl-00...', 'object': 'text_completion', 'created': .., 'model': '/path', 'choices': [
    #   {'text': '\n', 'index': 0, 'logprobs': None, 'finish_reason': None}
    # ]}
+   print(item["choices"][0]["text"], end="")
    yield item["choices"][0]["text"]
vividfog commented 1 year ago

Simply adding can_stream = True adds streaming support for non-chat mode for me.

llm_llama_cpp.py line 170 and a new 171:

class LlamaModel(llm.Model):
    can_stream = True

Together with adding max_tokens=4000 to streaming responses, as described in #6, the plugin appears to work as advertised. Those two would be good fixes to have in a release.