mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

Google Colab installation #8

Closed narita63755930 closed 9 months ago

narita63755930 commented 9 months ago

Hi

https://colab.research.google.com/drive/1YtXE_JKVntkGK14Yo9thjCjPMVzhA71d?usp=sharing

Here is the colab, but it doesn't run in colab it stops after a while due to memory overload or something like that. Also there are few changes to be made in the files which are downloaded in the steps in order for it run, so you can't run it as it is.

If you already have a good solution, please support it.

tomaarsen commented 9 months ago

The 13b model that you're using in the test is likely too large for Google Colab. a 7B one is much more likely to work. You should consider the following command:

!CUDA_VISIBLE_DEVICES=0 python3 streaming-llm/examples/run_streaming_llama.py  --enable_streaming --model_name_or_path lmsys/vicuna-7b-v1.3
zeeshanali-k commented 9 months ago

@tomaarsen After using the suggested model I am facing below error, it seems that model starts the inference and then some error occurs: USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

ASSISTANT: Traceback (most recent call last): File "/content/streaming-llm/examples/run_streaming_llama.py", line 122, in main(args) File "/content/streaming-llm/examples/run_streaming_llama.py", line 103, in main streaming_inference( File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/content/streaming-llm/examples/run_streaming_llama.py", line 73, in streaming_inference past_key_values = greedy_generate( ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/content/streaming-llm/examples/run_streaming_llama.py", line 20, in greedy_generate outputs = model( ^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1038, in forward outputs = self.model( ^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 925, in forward layer_outputs = decoder_layer( ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 635, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: llama_pos_shift_attention_forward() got an unexpected keyword argument 'padding_mask'

tomaarsen commented 9 months ago

You must downgrade transformers to below 4.34.0. I suspect that 4.33.0 does work.

narita63755930 commented 9 months ago

Hi tomaarsen

You must downgrade transformers to below 4.34.0. I suspect that 4.33.0 does work.

Thanks for the great support, Zeeshan is the engineer I'm asking to develop for me. we took your advice and it worked. Really Thank you . https://colab.research.google.com/drive/1YtXE_JKVntkGK14Yo9thjCjPMVzhA71d?usp=sharing

But here is the problem.

We want to read a Zip file or multiple files and parse and debug their contents, like GPT's code interpreter In this case, can we make use of the recommended chatbot or similar?

Is it possible to achieve our goal using Colab after this ?

DiTo97 commented 9 months ago

Hi @tomaarsen,

Why don't you just add offload_folder="offload", offload_state_dict=True to from_pretrained as shown here to try and mitigate that issue even when trying to load larger models (13B) on Colab free tier?

tomaarsen commented 9 months ago

That seems wise! Good recommendation. I don't tend to run into these issues as I don't generally work on Colab.

narita63755930 commented 9 months ago

@DiTo97 @tomaarsen

Thanks for the great feedback. We will try to implement 13B.

If any other members have successfully implemented 13B, please comment.

zeeshanali-k commented 9 months ago

@DiTo97 @tomaarsen Thanks for the suggestions, they were really helpful. Now we have bought Colab Pro+. The issue I am facing is that the inference is very slow for 34B or higher models even with a 50/60GB GPU. I am not an expert ML engineer or Deep Learning expert so my issues might seem basic. Can anyone describe why it is happening and is there any way to solve it or does stream require much high power to run smoothly on such large models?

tomaarsen commented 9 months ago

StreamingLLM is not noticeably slower than regular transformers, but such large models are indeed quite slow to run. There are methods to speed this up, like quantization or using non-Python runners (e.g. llama.cpp), but they might not be compatible with the StreamingLLM approach out of the box.

66RING commented 9 months ago

@zeeshanali-k @narita63755930

You may need to reduce the max_gen_len and recent_size

def streaming_inference(model, tokenizer, prompts, kv_cache=None, max_gen_len=1000):
    pass

recent_size

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # ...
    parser.add_argument("--recent_size", type=int, default=2000)
    args = parser.parse_args()

because the cache size retain in GPU should be: k[0: start] + k[seq_len - recent_size: ]