marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.81k stars 138 forks source link

how to apply stream into MPT-7B-Instruct-GGML model #17

Open SergeyTokarevHYS opened 1 year ago

SergeyTokarevHYS commented 1 year ago

I try to pass the arguments that are listed in the documentation, but I get nowhere.

handler = StdOutCallbackHandler() llm = CTransformers(model='TheBloke/MPT-7B-Instruct-GGML',model_file='mpt-7b-instruct.ggmlv3.q4_0.bin' , model_type='mpt',config={"stream":True, "max_new_tokens":256, "threads":6}, callbacks=[StreamingStdOutCallbackHandler()] ) llm(PROMPT_FOR_GENERATION_FORMAT.format(context=content, query=query)) but it looks not working. It does not return a generator. instead it returns a string. The model takes an extremely long time to think before it starts to print and the response speed is about the same as without the stream.

marella commented 1 year ago

LangChain LLMs must return a str (see method signature), so it won't return a generator because other LangChain modules that expect a str will break if they get a generator object. But the callbacks=[StreamingStdOutCallbackHandler()] should work and print text as it gets generated token by token.

There is a stream() method some LLMs have (see this) which returns a generator but this is an experimental feature so I didn't add it to the CTransformers class.

It is possible to get a generator using the core library without LangChain:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml')

for chunk in llm('AI is going to', stream=True):
    print(chunk, end='', flush=True)

But if you want to use it with only LangChain, I can send a PR to add the stream() method to the CTransformers class in LangChain.