Export Generated Text 1 Token at a Time

anujnayyar1 commented 2 years ago

Feature request

When using the text-generation pipeline. We would like to be able export each token as it is generated. Currently we have to wait for the generation to be completed to view the results.

Motivation

Using text-generation in a production environment, this would greatly improve the user experience. Users currently have to wait for text to be generated. If we are able to implement this they could read text as it is generated by the models.

Your contribution

I would be able to bug check this feature if it was added!

LysandreJik commented 2 years ago

WDYT of such a feature @Narsil?

Narsil commented 2 years ago

I like the idea.

How would that look like though code wise ?


pipe = pipeline('text-generation")
# Regular usage

generated = pipe("This is my prompt")

for out in pipe("This is my prompt", continuous=True):
    # out = [{"generated_text": " and"}]
    #

The biggest caveat with this idea is this parameters will probably be hard to cumulate with things like batch_size and num_beams. We can disable some options if some combinations of arguments are provided, but in general I prefer when all combinations of parameters are available.

Other idea would be to somehow add a callback within generate to receive the ids also as they come in. What I don't like, is callback (not easy to work with and debug), but it could be much easier to implement, since we're just injecting something within generate.

def intermediate_results(out):
    print(out)

pipe = pipeline("text-generation")
out = pipe("This is my prompt", continous_fn=print_intermediate_results))

Pinging @patrickvonplaten to see if you have ideas to get continuous tokens within generate.

patrickvonplaten commented 2 years ago

@gante @patil-suraj could you take a look here?

gante commented 2 years ago

As @Narsil said, in greedy search/sample generation, we can loop over and call generation with one new token at a time. The performance penalty is not that big, a bit over 2x (on colab, the penalty probably grows with sequence length), and is trivial to implement.

For beam search generation, either there is some channel to push sequences as they are generated, or the whole generation logic is somehow exposed to correctly keep track of running sequences/scores. The latter seems unfeasible, the former could be done e.g. with some form of asynchronous queue (one thread runs generate and pushes to the queue, another reads from the queue).

I'm not experienced in these matters, but... the cost/benefit ratio doesn't seem good (for beam search) 😅

patil-suraj commented 2 years ago

I like the idea, but I think it won't be trivial to implement given the current complexity of generate. Even for greedy search/sampling, simply calling generate for one token at a time will be very slow, as it won't be able to take advantage of caching.

Adding callback seems a good idea IMO as it won't clutter generate a lot. wdyt @patrickvonplaten @gante

Narsil commented 2 years ago

Both can leverage the current generate and do NOT call generate 1 step at a time in my mind. Both would use a callback within generate but the idea is to understand how a user would use those results.

I was merely asking how it should look live from a pipeline user perspective.

anujnayyar1 commented 2 years ago

As a user OpenAI deal with this quite well.

They use server sent events to send over partial completions - aka the JavaScript EventSource library

See “stream” https://beta.openai.com/docs/api-reference/completions/create

patrickvonplaten commented 2 years ago

To be honest, I'm not in favor of adding this to generate - it's too much of a nice-to-have feature and would unnecessarily increase maintenance and make generate much harder to understand than it already is

patrickvonplaten commented 2 years ago

If it's possible to make it easy and clean with a general callbacks: Optional[GenerationCallback] = None function arg I think I'd be fine with it though, but would need to see a PR for it

patrickvonplaten commented 2 years ago

Then inside generate() ideally we only have one if callbacks is not None: then call all callbacks code

Narsil commented 2 years ago

from transformers import pipeline
import torch
import threading
from transformers.generation_stopping_criteria import StoppingCriteria, StoppingCriteriaList
from queue import Queue

pipe = pipeline(model="hf-internal-testing/tiny-random-gpt2", task="text-generation", device=0)

class Stream(StoppingCriteria):
    def __init__(self, q):
        self.q = q

    def __call__(self, input_ids, scores) -> bool:
        self.q.put(input_ids)
        return False

queue = Queue()

def gen():
    pipe.model.generate(
        torch.LongTensor([[0, 1, 2]]).cuda(),
        stopping_criteria=StoppingCriteriaList([Stream(queue)]),
        max_new_tokens=10,
    )
    print("Finished generation")
    queue.put(False)

threading.Thread(target=gen).start()

while True:
    i = queue.get()
    if i is False:
        break
    else:
        print("Got i", pipe.tokenizer.decode(i[0]))

What do you think about this ?

I thought this would be an elegant solution to the problem. Basically send generate to another thread and wait for results as they are coming.

The main drawback for pipelines as I said, is the other parameters combinations + backward compatibility support. (+ Threads are a nightmare and if users are already using pipelines within thread/async/multiprocessing bad things might happen)

patrickvonplaten commented 2 years ago

I'd be fine with this design - think it's nice! Think we should maybe put it under a new class though, called Callback instead of StoppingCriteria ?

Narsil commented 2 years ago

Think we should maybe put it under a new class though, called Callback instead of StoppingCriteria ?

Yes for sure, this was the minimal code, definitely not fit for merge. Again, lots of caveats too with this approach, but at least it could be implemented relatively fast.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

oobabooga commented 1 year ago

Has there been any progress on this since last year?

I am interested in generating one token at a time for an interactive text generation web UI. But simply calling model.generate with max_new_tokens=1 multiple times is a lot slower (about 2x) than generating all the tokens at once.

gante commented 1 year ago

@oobabooga no progress, but I have it in my backlog for exploration. Very unlikely that it will see the light of day in the next ~6 months, though :)

peakji commented 1 year ago

FYI, I made a streaming generation service for Hugging Face transformers that is fully compatible with the OpenAI API: https://github.com/hyperonym/basaran

huggingface / transformers