Stream tokens output - Githubissues

marcoripa96 commented 1 year ago

Is it possible to stream each token of the output as soon as it is generated by the model? I guess it depends on the hugging face transformers classes and methods. Any solution to this?

baleksey commented 1 year ago

Yeah, you can do that. You need to create LogitsWarper

from transformers import LogitsWarper
class CallbackLogitsWarper(LogitsWarper):
    def __init__(self, tokenizer, callback):
        self.tokenizer = tokenizer
        self.callback = callback
        self.res_tokens = []

    def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> torch.FloatTensor:
        self.res_tokens.append(input_ids[0][-1])
        result = self.tokenizer.decode(self.res_tokens).lstrip()
        self.callback(result) # return current generation already in words format back
        return scores

Then add logits_processor param to your mode.generate()

def callback(result):
    print(result)

generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=True,
    output_scores=True,
    max_new_tokens=100,
    logits_processor=[CallbackLogitsWarper(tokenizer, callback)],
)

marcoripa96 commented 1 year ago

Thanks for the fast replay! I tried but it doesn't seem those are the final tokens produced in the generation_output variable.

This is what I did:

from transformers import LogitsWarper
import torch
class CallbackLogitsWarper(LogitsWarper):
    def __init__(self, tokenizer, callback):
        self.tokenizer = tokenizer
        self.callback = callback
        self.res_tokens = []

    def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> torch.FloatTensor:
        self.res_tokens.append(input_ids[0][-1])
        # result = self.tokenizer.decode(self.res_tokens).lstrip()
        result = self.tokenizer.decode(input_ids[0][-1])
        self.callback(result)
        return scores

generation_config = GenerationConfig(
    temperature=0.1,
    top_p=0.75,
    num_beams=4,
)

def callback(result):
    sys.stdout.write(result)

def evaluate(instruction, input=None):
    prompt = generate_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256,
        logits_processor=[CallbackLogitsWarper(tokenizer, callback)]
    )

    for s in generation_output.sequences:
        output = tokenizer.decode(s)
        print()
        print("Response:", output.split("### Response:")[1].strip())

And this is what is being produced by "streaming" each token compared to the final generation_output:

Instruction: Give me a random sentence
 : <0x0A> The sun was sh ining bright ly in the sky, casting a ating the low over the landscape. </s> <0x0A>
Response: The sun was shining brightly in the clear blue sky.

baleksey commented 1 year ago

In GenerationConfig remove num_beams parameter at all (or set it to 1)

marcoripa96 commented 1 year ago

It kinda works now, but still I don't get why I get partial tokens and I don't really have a way to know which tokens needs to be joined (ch icken in the example below). Additionally what's with the : and <0x0A> strings at the beginning of the generated sentence?

Instruction: Tell me a funny joke
Response obtained decoding one token at a time : <0x0A> Why did the ch icken cross the road? To get to the other side! </s>
Response obtained decoding all tokens together: Why did the chicken cross the road? To get to the other side!

baleksey commented 1 year ago

": and <0x0A>" - are the part of trained response (actually it's ":\n"). You can replace(":\n","") before print out so you won't see it. As for strange "ch icken" this is because (not sure 100%) different tokens combinations can produce different results by tokenizer.decode() and if you try to decode each token by it's own - you'll see that often. That's why in my example I decode the whole result and send it instead of separate tokens. In this way result will be stable and expected.

mripacampus commented 1 year ago

I see. The problem with decoding each time the whole sequence is that it’s really not optimal. Imagine a server streaming the response. Ideally you would send one token each time and a possibile client would concatenate the responses to form the complete output.

I tried looking around and it seems there would be open issues for supporting streaming on the generate function for the HG transformers library. I guess for now I can use this.

baleksey commented 1 year ago

It's not optimal, agree. Using this hack on my local app/gui so don't really care for now.

Please, let us know if you find a way to stream it one at a time without those artifacts.

maralski commented 1 year ago

Look at the approach used in this repo. It is a different model but you may be able to replicate the streaming approach.

canonhui commented 1 year ago

can we yield the value of the variable result in CallbackLogitsWarper.__call__ to a generator?

mripacampus commented 1 year ago

You can, I used a queue for that. Enqueue the result of the callback and infinitely loop and wait till there is a new item in the queue, then yield that item. I’ll give an example later on

canonhui commented 1 year ago

So in this way, we need multiprocessing to simultaneously run the generate and the yield?

mripacampus commented 1 year ago

Exactly! This is how I did it:

from threading import Thread
from queue import Queue

def generate_streaming_completion(options):
  model = options.pop("model")
  tokenizer = options.pop("tokenizer")
  model_options = options.pop("model_options")
  stream = model_options.stream and model_options.num_beams == 1

  q = Queue()

  generation_config = GenerationConfig(
    temperature=model_options.temperature,
    top_p=model_options.top_p,
    top_k=model_options.top_k,
    num_beams=model_options.num_beams,
    max_new_tokens=model_options.max_new_tokens,
  )

  prompt = generate_prompt(model_options.instruction, model_options.input)
  inputs = tokenizer(prompt, return_tensors="pt")
  input_ids = inputs["input_ids"].cuda()

  def stream_callback(res):
    q.put(json.dumps({ "text": res }) + '\n')

  logits_processor= [CallbackLogitsWarper(tokenizer, stream_callback)] if stream else None

  def generate():
    with torch.no_grad():
      model.eval()
      model.generate(
          input_ids=input_ids,
          generation_config=generation_config,
          logits_processor=logits_processor,
          return_dict_in_generate=True,
          # output_scores=True,
          # max_new_tokens=600
      )
      print('STREAMING DONE')
    torch.cuda.empty_cache()
    q.put("[DONE]")

  # start the generate function in a new Thread so that the code doesn't stop executing here
  Thread(target=generate,args=()).start()

  while True:
    next_item = q.get(True,10000) # Blocks until an input is available
    if next_item == "[DONE]":
        yield next_item
        break
    yield next_item

canonhui commented 1 year ago

That's awesome! Solved my problem perfectly! Greatly appreciated😊

fragro commented 1 year ago

Looks like the issue with incomplete words has been solved with the latest PR to transformers:

https://github.com/huggingface/transformers/pull/22449

Going to give this a shot this evening.

gante commented 1 year ago

@fragro some extra context behind https://github.com/huggingface/transformers/pull/22449 -- different tokenizers have different strategies to stitch tokens together, so the solution I've built there is a simple heuristic that (AFAIK) works with all models/tokenizers.

If you notice it is causing slowdowns, I'm sure better model-specific strategies can be found. However, since decoding is not a heavy task, it should be fine :)

tloen / alpaca-lora

Stream tokens output #51