Closed anujnayyar1 closed 2 years ago
WDYT of such a feature @Narsil?
I like the idea.
How would that look like though code wise ?
pipe = pipeline('text-generation")
# Regular usage
generated = pipe("This is my prompt")
for out in pipe("This is my prompt", continuous=True):
# out = [{"generated_text": " and"}]
#
The biggest caveat with this idea is this parameters will probably be hard to cumulate with things like batch_size
and num_beams
. We can disable some options if some combinations of arguments are provided, but in general I prefer when all combinations of parameters are available.
Other idea would be to somehow add a callback within generate
to receive the ids also as they come in.
What I don't like, is callback (not easy to work with and debug), but it could be much easier to implement, since we're just injecting something within generate
.
def intermediate_results(out):
print(out)
pipe = pipeline("text-generation")
out = pipe("This is my prompt", continous_fn=print_intermediate_results))
Pinging @patrickvonplaten to see if you have ideas to get continuous tokens within generate
.
@gante @patil-suraj could you take a look here?
As @Narsil said, in greedy search/sample generation, we can loop over and call generation with one new token at a time. The performance penalty is not that big, a bit over 2x (on colab, the penalty probably grows with sequence length), and is trivial to implement.
For beam search generation, either there is some channel to push sequences as they are generated, or the whole generation logic is somehow exposed to correctly keep track of running sequences/scores. The latter seems unfeasible, the former could be done e.g. with some form of asynchronous queue (one thread runs generate and pushes to the queue, another reads from the queue).
I'm not experienced in these matters, but... the cost/benefit ratio doesn't seem good (for beam search) 😅
I like the idea, but I think it won't be trivial to implement given the current complexity of generate
. Even for greedy search/sampling, simply calling generate
for one token at a time will be very slow, as it won't be able to take advantage of caching.
Adding callback seems a good idea IMO as it won't clutter generate
a lot. wdyt @patrickvonplaten @gante
Both can leverage the current generate
and do NOT call generate
1 step at a time in my mind.
Both would use a callback within generate
but the idea is to understand how a user would use those results.
I was merely asking how it should look live from a pipeline user perspective.
As a user OpenAI deal with this quite well.
They use server sent events to send over partial completions - aka the JavaScript EventSource library
See “stream” https://beta.openai.com/docs/api-reference/completions/create
To be honest, I'm not in favor of adding this to generate
- it's too much of a nice-to-have feature and would unnecessarily increase maintenance and make generate
much harder to understand than it already is
If it's possible to make it easy and clean with a general callbacks: Optional[GenerationCallback] = None
function arg I think I'd be fine with it though, but would need to see a PR for it
Then inside generate()
ideally we only have one if callbacks is not None: then call all callbacks
code
from transformers import pipeline
import torch
import threading
from transformers.generation_stopping_criteria import StoppingCriteria, StoppingCriteriaList
from queue import Queue
pipe = pipeline(model="hf-internal-testing/tiny-random-gpt2", task="text-generation", device=0)
class Stream(StoppingCriteria):
def __init__(self, q):
self.q = q
def __call__(self, input_ids, scores) -> bool:
self.q.put(input_ids)
return False
queue = Queue()
def gen():
pipe.model.generate(
torch.LongTensor([[0, 1, 2]]).cuda(),
stopping_criteria=StoppingCriteriaList([Stream(queue)]),
max_new_tokens=10,
)
print("Finished generation")
queue.put(False)
threading.Thread(target=gen).start()
while True:
i = queue.get()
if i is False:
break
else:
print("Got i", pipe.tokenizer.decode(i[0]))
What do you think about this ?
I thought this would be an elegant solution to the problem. Basically send generate to another thread and wait for results as they are coming.
The main drawback for pipelines as I said, is the other parameters combinations + backward compatibility support. (+ Threads are a nightmare and if users are already using pipelines within thread/async/multiprocessing bad things might happen)
I'd be fine with this design - think it's nice! Think we should maybe put it under a new class though, called Callback
instead of StoppingCriteria
?
Think we should maybe put it under a new class though, called Callback instead of StoppingCriteria ?
Yes for sure, this was the minimal code, definitely not fit for merge. Again, lots of caveats too with this approach, but at least it could be implemented relatively fast.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Has there been any progress on this since last year?
I am interested in generating one token at a time for an interactive text generation web UI. But simply calling model.generate
with max_new_tokens=1
multiple times is a lot slower (about 2x) than generating all the tokens at once.
@oobabooga no progress, but I have it in my backlog for exploration. Very unlikely that it will see the light of day in the next ~6 months, though :)
FYI, I made a streaming generation service for Hugging Face transformers that is fully compatible with the OpenAI API: https://github.com/hyperonym/basaran
Feature request
When using the text-generation pipeline. We would like to be able export each token as it is generated. Currently we have to wait for the generation to be completed to view the results.
Motivation
Using text-generation in a production environment, this would greatly improve the user experience. Users currently have to wait for text to be generated. If we are able to implement this they could read text as it is generated by the models.
Your contribution
I would be able to bug check this feature if it was added!