Async support for the `generate` method

jpzhangvincent commented 1 year ago

Feature request

allow for better production integration for chatbot use cases to support websocket async calls

Motivation

OpenAI API also has async generate method. It will allow to have better integration with langchain as well.

Your contribution

Suggestion:

khaledadrani commented 1 year ago

The issue from https://github.com/hwchase17/langchain/issues/5210, led me here!

I actually managed to implement a fix for the gpt4all wrapper, but I have my doubts that the model itself is not streaming too, since it executes the default stdout streamer then executes my async callback. (I'm still investigating why I'm unable to remove the default callback, even though I overwrited the callbacks argument)

So it seems that the model is streaming to the stdout streamer, does that mean that the model itself generates its response token by token? in this case, we can implement an async version of the normal call, with a list of callbacks as an argument? Can someone confirm if the model generates its output one token/word at a time?

cosmic-snow commented 1 year ago

The generate(..., streaming=True) method of v1.0.3 of the bindings is almost, but not quite there. I guess it doesn't hurt to leave some examples here so people can use it with async code:

import asyncio
import sys
from gpt4all import GPT4All

async def response1(g4a, prompt):
    # this is an asynchronous generator, see PEP 525
    # simple, neat; the problem is, this is still blocking
    for token in g4a.generate(prompt, streaming=True):
        yield token

async def response2(g4a, prompt):
    # a bit better, although blocking, it's interleaved
    for token in g4a.generate(prompt, streaming=True):
        yield token
        await asyncio.sleep(0.01)

async def response3(g4a, prompt):
    # doing it somehow with an executor, but not properly:
    generator = g4a.generate(prompt, streaming=True)
    event_loop = asyncio.get_running_loop()
    while True:
        try:
            yield await event_loop.run_in_executor(None, next, generator)
        except StopIteration:
            break

async def response4(g4a, prompt):
    # a better way to use an executor:
    generator = g4a.generate(prompt, streaming=True)
    event_loop = asyncio.get_running_loop()
    has_tokens = True

    def consume(generator):
        nonlocal has_tokens
        try:
            return next(generator)
        except StopIteration:
            has_tokens = False

    while has_tokens:
        token = await event_loop.run_in_executor(None, consume, generator)
        if token is not None:
            yield token

async def main1(model_name, model_folder, prompt):
    g4a = GPT4All(model_name, model_folder, allow_download=False)
    await asyncio.sleep(1)
    async for token in response1(g4a, prompt):
        print(token, end='', flush=True)
    print()

async def main2(model_name, model_folder, prompt):
    g4a = GPT4All(model_name, model_folder, allow_download=False)
    await asyncio.sleep(1)
    # asynchronous comprehension example
    [print('>', token) async for token in response1(g4a, prompt)]

async def main3(model_name, model_folder, prompt):
    g4a = GPT4All(model_name, model_folder, allow_download=False)
    await asyncio.sleep(1)
    # another asynchronous comprehension
    tokens = [token async for token in response1(g4a, prompt)]
    print('|'.join(tokens))

async def ticker():
    # The more 'tick' outputs are interleaving the response, the better
    for i in range(20):
        await asyncio.sleep(0.5)
        print(f"  -> tick {i} <-")

if __name__ == '__main__':
    event_loop = asyncio.get_event_loop()
    try:
        model_name = sys.argv[1]
        model_folder = sys.argv[2]
        prompt = input("prompt: ")
        tasks = asyncio.gather(main1(model_name, model_folder, prompt), ticker())
        event_loop.run_until_complete(tasks)
    except Exception as exc:
        sys.exit(exc)
    finally:
        event_loop.run_until_complete(event_loop.shutdown_asyncgens())
        event_loop.close()

Notes:

Save as e.g. async_example.py.
Run the example with e.g. python3 async_example.py "ggml-gpt4all-j-v1.3-groovy.bin" "/path/to/models/"
- Make sure it'd actually produce output, e.g. with groovy I used a simple "what is a moon?"
First, check the line tasks = asyncio.gather(...):
- ticker() is there to make it clear where things are blocking. Remove it if you just want output.
- The ticker prints its own message every 0.5s for 10s
Then play around with different main and response functions.
- main functions have a 1s sleep so that the ticker can already run a bit. They're otherwise not needed.
The response functions:
- response1() and response2() use a simple Asynchronous Generator. Using yield like that on the underlying Queue still blocks, however (not the async kind of block).
- The response1() approach is not that great, response2() is a bit better.
- response3() and response4() use an executor to get rid of the blocking.
- response3() shows the naïve approach, but this won't work well. It causes an exception.
- Of all these, response4() is probably the best solution.

P.S. I had to brush up my knowledge for this, too. Hope there are no glaring mistakes.

OxaD commented 12 months ago

Thank you very much for sharing this @cosmic-snow ! That's exactly what I've been looking for the past a few days.

There is one last thing I could not figure out using gpt4all documentation. In the GUI/desktop app, responses can be very long. This is why we don't even need a [Continue] button (like the one appearing on ChatGPT when the response is incomplete). Do you have any recommandation/suggestion on how to reproduce this ? In other words, automatically ask for more tokens until the real "end of response" occurs ?

cosmic-snow commented 12 months ago

Do you have any recommandation/suggestion on how to reproduce this ? In other words, automatically ask for more tokens until the real "end of response" occurs ?

I have not tried that myself, but I guess you could just call again to let it generate more:

Now it depends a bit on what your code looks like. With v1.x of the bindings, you'll have to use a session if you want to simply continue, otherwise you'd have to feed the whole input/output back into generate().

Next, you need to be aware of what the templates are like. If you use any, disable that and try to just send it an empty string as prompt. See if that actually gets it going again. If not, maybe tell it explicitly to "go on" or "continue" (maybe try that with templates, though, as otherwise it'd probably interpret it as part of "its own text").

OxaD commented 12 months ago

Thank you for your quick reply, I appreciate it. The 'empty string' kind of work somehow. I've tried it, and even when it's being used within a template, the results aren't that bad. The only issue is that once it reaches the end, it seems to repeatedly generate the exact same response over and over again. I will disable the template as you suggested to experiment more.

Before overthinking anything, I will start with studying more open-source code to avoid wasting a significant amount of time attempting to reinvent the wheel. However, if I end up implementing my own solution, your suggested approaches will certainly be useful ;)

cosmic-snow commented 12 months ago

The only issue is that once it reaches the end, it seems to repeatedly generate the exact same response over and over again. I will disable the template as you suggested to experiment more.

Before overthinking anything, I will start with studying more open-source code to avoid wasting a significant amount of time attempting to reinvent the wheel. However, if I end up implementing my own solution, your suggested approaches will certainly be useful ;)

Well, a problem is that it depends a lot of what model you're using, and many models expect specific templates to behave nicely. So it's hard to give general advice and you will have to tinker for a bit to get good results.

Additionally, there are several parameters that influence the result on one way or another (the big ones are temperature, top-P, top-K). Also by the way, if you haven't already, take a closer look at the API. There is a "cut-off" parameter called max_tokens (earlier called: n_predict).

niansa commented 11 months ago

This issue seems to have gotten stale, please reopen if that isn't actually the case.

Please always feel free to open more issues as needed.

cosmic-snow commented 11 months ago

This has not yet been implemented.

The examples I provided in the comment above are more of a work-around and require an additional thread (response3and response4). Or they are blocking (response1 and response2), which is not a good thing in Python asyncio.

varshasathya commented 11 months ago

Hi When I try to 'arun' the below chain using Sagemaker Endpoint, I'm receiving the following error. chain = LLMChain(llm = SagemakerEndpoint(endpoint_name=llm_ENDPOINT,region_name=REGION_NAME, content_handler=content_handler), prompt=prompt)

NotImplementedError: Async generation not implemented for this LLM.

Is the Async calls available for Sagemaker Endpoint? If not, is there a workaround for the same.

Thanks in advance.

nomic-ai / gpt4all