tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Use with `pipeline` or `generate` #7

Closed helleuch closed 10 months ago

helleuch commented 10 months ago

Hello, Thank you very much for making this work available. I would like to ask you if this works with the transformer.pipeline or the model.generate ? Or do we have to use a TextStreamer as per the example ?

tomaarsen commented 10 months ago

TextStreamer is a very simple class from transformers that only has one goal: printing out tokens as they are being generated - it's kind of separate from the generation process itself.

You can do generation like model.generate just like in normal transformers. Only if you want to use multi-step generation (i.e. where the previous outputs are added as history to new prompts), then you should use a manual generation loop that calls model(input_ids, use_cache=True, past_key_values=past_key_values) and extracts the past_key_values for the next inputs and extracts the logits to determine the generated token.

helleuch commented 10 months ago

Thank you very much !