Open jpzhangvincent opened 1 year ago
The issue from https://github.com/hwchase17/langchain/issues/5210, led me here!
I actually managed to implement a fix for the gpt4all wrapper, but I have my doubts that the model itself is not streaming too, since it executes the default stdout streamer then executes my async callback. (I'm still investigating why I'm unable to remove the default callback, even though I overwrited the callbacks argument)
So it seems that the model is streaming to the stdout streamer, does that mean that the model itself generates its response token by token? in this case, we can implement an async version of the normal call, with a list of callbacks as an argument? Can someone confirm if the model generates its output one token/word at a time?
The generate(..., streaming=True)
method of v1.0.3 of the bindings is almost, but not quite there. I guess it doesn't hurt to leave some examples here so people can use it with async
code:
import asyncio
import sys
from gpt4all import GPT4All
async def response1(g4a, prompt):
# this is an asynchronous generator, see PEP 525
# simple, neat; the problem is, this is still blocking
for token in g4a.generate(prompt, streaming=True):
yield token
async def response2(g4a, prompt):
# a bit better, although blocking, it's interleaved
for token in g4a.generate(prompt, streaming=True):
yield token
await asyncio.sleep(0.01)
async def response3(g4a, prompt):
# doing it somehow with an executor, but not properly:
generator = g4a.generate(prompt, streaming=True)
event_loop = asyncio.get_running_loop()
while True:
try:
yield await event_loop.run_in_executor(None, next, generator)
except StopIteration:
break
async def response4(g4a, prompt):
# a better way to use an executor:
generator = g4a.generate(prompt, streaming=True)
event_loop = asyncio.get_running_loop()
has_tokens = True
def consume(generator):
nonlocal has_tokens
try:
return next(generator)
except StopIteration:
has_tokens = False
while has_tokens:
token = await event_loop.run_in_executor(None, consume, generator)
if token is not None:
yield token
async def main1(model_name, model_folder, prompt):
g4a = GPT4All(model_name, model_folder, allow_download=False)
await asyncio.sleep(1)
async for token in response1(g4a, prompt):
print(token, end='', flush=True)
print()
async def main2(model_name, model_folder, prompt):
g4a = GPT4All(model_name, model_folder, allow_download=False)
await asyncio.sleep(1)
# asynchronous comprehension example
[print('>', token) async for token in response1(g4a, prompt)]
async def main3(model_name, model_folder, prompt):
g4a = GPT4All(model_name, model_folder, allow_download=False)
await asyncio.sleep(1)
# another asynchronous comprehension
tokens = [token async for token in response1(g4a, prompt)]
print('|'.join(tokens))
async def ticker():
# The more 'tick' outputs are interleaving the response, the better
for i in range(20):
await asyncio.sleep(0.5)
print(f" -> tick {i} <-")
if __name__ == '__main__':
event_loop = asyncio.get_event_loop()
try:
model_name = sys.argv[1]
model_folder = sys.argv[2]
prompt = input("prompt: ")
tasks = asyncio.gather(main1(model_name, model_folder, prompt), ticker())
event_loop.run_until_complete(tasks)
except Exception as exc:
sys.exit(exc)
finally:
event_loop.run_until_complete(event_loop.shutdown_asyncgens())
event_loop.close()
Save as e.g. async_example.py
.
Run the example with e.g. python3 async_example.py "ggml-gpt4all-j-v1.3-groovy.bin" "/path/to/models/"
groovy
I used a simple "what is a moon?"First, check the line tasks = asyncio.gather(...)
:
ticker()
is there to make it clear where things are blocking. Remove it if you just want output.Then play around with different main
and response
functions.
main
functions have a 1s sleep so that the ticker can already run a bit. They're otherwise not needed.The response
functions:
response1()
and response2()
use a simple Asynchronous Generator. Using yield
like that on the underlying Queue
still blocks, however (not the async
kind of block).response1()
approach is not that great, response2()
is a bit better.response3()
and response4()
use an executor to get rid of the blocking.response3()
shows the naïve approach, but this won't work well. It causes an exception.response4()
is probably the best solution.
P.S. I had to brush up my knowledge for this, too. Hope there are no glaring mistakes.
Thank you very much for sharing this @cosmic-snow ! That's exactly what I've been looking for the past a few days.
There is one last thing I could not figure out using gpt4all documentation. In the GUI/desktop app, responses can be very long. This is why we don't even need a [Continue] button (like the one appearing on ChatGPT when the response is incomplete). Do you have any recommandation/suggestion on how to reproduce this ? In other words, automatically ask for more tokens until the real "end of response" occurs ?
Do you have any recommandation/suggestion on how to reproduce this ? In other words, automatically ask for more tokens until the real "end of response" occurs ?
I have not tried that myself, but I guess you could just call again to let it generate more:
Now it depends a bit on what your code looks like. With v1.x of the bindings, you'll have to use a session if you want to simply continue, otherwise you'd have to feed the whole input/output back into generate()
.
Next, you need to be aware of what the templates are like. If you use any, disable that and try to just send it an empty string as prompt. See if that actually gets it going again. If not, maybe tell it explicitly to "go on" or "continue" (maybe try that with templates, though, as otherwise it'd probably interpret it as part of "its own text").
Thank you for your quick reply, I appreciate it. The 'empty string' kind of work somehow. I've tried it, and even when it's being used within a template, the results aren't that bad. The only issue is that once it reaches the end, it seems to repeatedly generate the exact same response over and over again. I will disable the template as you suggested to experiment more.
Before overthinking anything, I will start with studying more open-source code to avoid wasting a significant amount of time attempting to reinvent the wheel. However, if I end up implementing my own solution, your suggested approaches will certainly be useful ;)
The only issue is that once it reaches the end, it seems to repeatedly generate the exact same response over and over again. I will disable the template as you suggested to experiment more.
Before overthinking anything, I will start with studying more open-source code to avoid wasting a significant amount of time attempting to reinvent the wheel. However, if I end up implementing my own solution, your suggested approaches will certainly be useful ;)
Well, a problem is that it depends a lot of what model you're using, and many models expect specific templates to behave nicely. So it's hard to give general advice and you will have to tinker for a bit to get good results.
Additionally, there are several parameters that influence the result on one way or another (the big ones are temperature, top-P, top-K). Also by the way, if you haven't already, take a closer look at the API. There is a "cut-off" parameter called max_tokens
(earlier called: n_predict
).
This issue seems to have gotten stale, please reopen if that isn't actually the case.
Please always feel free to open more issues as needed.
This has not yet been implemented.
The examples I provided in the comment above are more of a work-around and require an additional thread (response3
and response4
). Or they are blocking (response1
and response2
), which is not a good thing in Python asyncio.
Hi When I try to 'arun' the below chain using Sagemaker Endpoint, I'm receiving the following error. chain = LLMChain(llm = SagemakerEndpoint(endpoint_name=llm_ENDPOINT,region_name=REGION_NAME, content_handler=content_handler), prompt=prompt)
NotImplementedError: Async generation not implemented for this LLM.
Is the Async calls available for Sagemaker Endpoint? If not, is there a workaround for the same.
Thanks in advance.
Feature request
allow for better production integration for chatbot use cases to support websocket async calls
Motivation
OpenAI API also has async generate method. It will allow to have better integration with langchain as well.
Your contribution
Suggestion: