Use python generator instead of streamer for generation

JamesDConley commented 1 year ago

Feature request

Add an option for receiving tokens (or similar) as they are generated via a python generator as an alternative to needing a streamer object.

Motivation

There is a new feature streamers for accessing the tokens being generated during generation. Usage of this object requires you to run some code in parallel while the model.generate function blocks it's current thread. You need to instead have your processing code defined like a callback within the streamer object you are using.

A much simpler interface that solves this same problem is to yield the token sequences as they are generated with a python generator. Below is example usage for either case...

Proposed Generator Implementation

for token in model.generate(**inputs, max_new_tokens=20, yield_tokens=True):
   print(f"The next token is {token}")

Current Streamer Implementation

from transformers import AutoModelForCausalLM, TextStreamer

class MyStreamer:
   def __init__(self):
      pass
   def put(self, token):
      print(f"The next token is {token}")
   def end():
      pass
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)

Not only does the generator implementation save on lines of code/simplify syntax, but python generators return iterables which has the benefit of making it easy to use all sorts of existing python tools without modification. For example, you can

Enumerate

for idx, token in enumerate(model.generate(**inputs, max_new_tokens=20, yield_tokens=True)):
   print(f"The {idx}'th token is {token}")

Progress bar with TQDM

Progress bar appears in CLI or jupyter notebook, updating in real time

for token in tqdm(model.generate(**inputs, max_new_tokens=20, yield_tokens=True)):
   my_endpoint.post(token)

And there's many many more tools that would easily integrate!

In this case I proposed tokens because it's easier to think about that way, and it matches the current streamer implementation, but it may be easier to implement yielding a list of lists of tokens, since for beam search and similar multiple beams (multiple sequences) are being considered at any given time. This would enable more features on the developer side, esp in the case where you may want to generate multiple sequences in one call. But this is more of a sidenote and either of this or the base implementation would be really awesome.

Your contribution

I'm not planning to put in a PR anytime soon, but I did have a look through the code before finding the new streamer WIP feature. It seems like it would be fairly easy to implement a version of what I am describing. You just need to add a flag to optionally

yield new_token

inside each of beam_search, beam_sample, greedy_search, etc- and then update the model.generate wrapper to also optionally yield the results from each of these.

In this case I proposed tokens because it's easier to think about that way, and it matches the current streamer implementation, but it may be easier to implement yielding a list of lists of tokens, since for beam search and such multiple beams are being considered at any given time.

amyeroberts commented 1 year ago

cc @gante

ambiSk commented 1 year ago

This has been mentioned in PR: 2249 We need this feature, @gante @oobabooga, can you provide a short script how to try this out when calling model.generate, like this function work as a python generator object.

ambiSk commented 1 year ago

@JamesDConley I found this https://huggingface.co/spaces/joaogante/transformers_streaming. I think this could be great start with your problem.

sgugger commented 1 year ago

@ambiSk This is on @gante roadmap but note he is on vacation for two weeks, so you will have to be a bit patient :-)

gante commented 1 year ago

Hey @JamesDConley @ambiSk -- I agree the generator structure is superior, and that is why you see a warning in the docs saying the existing API is temporary (e.g. here).

Back when I was exploring the MVP of the feature, I managed to get an iterator going. However, it required significant changes in .generate, adding yield from statements in a few places and restructuring a few bits so that the tokens could be piped out correctly. The branch is in a very incomplete state (see here), and I don't expect to be able to pick it up in the next ~2 months -- if anyone would like to get their hands dirty, feel free to pick this feature up 🙌

(just let me know if you decide to work on it :) )

huggingface / transformers