Closed narita63755930 closed 9 months ago
The 13b model that you're using in the test is likely too large for Google Colab. a 7B one is much more likely to work. You should consider the following command:
!CUDA_VISIBLE_DEVICES=0 python3 streaming-llm/examples/run_streaming_llama.py --enable_streaming --model_name_or_path lmsys/vicuna-7b-v1.3
@tomaarsen After using the suggested model I am facing below error, it seems that model starts the inference and then some error occurs: USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.
ASSISTANT: Traceback (most recent call last):
File "/content/streaming-llm/examples/run_streaming_llama.py", line 122, in
You must downgrade transformers
to below 4.34.0. I suspect that 4.33.0 does work.
Hi tomaarsen
You must downgrade
transformers
to below 4.34.0. I suspect that 4.33.0 does work.
Thanks for the great support, Zeeshan is the engineer I'm asking to develop for me. we took your advice and it worked. Really Thank you . https://colab.research.google.com/drive/1YtXE_JKVntkGK14Yo9thjCjPMVzhA71d?usp=sharing
But here is the problem.
We want to read a Zip file or multiple files and parse and debug their contents, like GPT's code interpreter In this case, can we make use of the recommended chatbot or similar?
Is it possible to achieve our goal using Colab after this ?
Hi @tomaarsen,
Why don't you just add offload_folder="offload", offload_state_dict=True
to from_pretrained
as shown here to try and mitigate that issue even when trying to load larger models (13B) on Colab free tier?
That seems wise! Good recommendation. I don't tend to run into these issues as I don't generally work on Colab.
@DiTo97 @tomaarsen
Thanks for the great feedback. We will try to implement 13B.
If any other members have successfully implemented 13B, please comment.
@DiTo97 @tomaarsen Thanks for the suggestions, they were really helpful. Now we have bought Colab Pro+. The issue I am facing is that the inference is very slow for 34B or higher models even with a 50/60GB GPU. I am not an expert ML engineer or Deep Learning expert so my issues might seem basic. Can anyone describe why it is happening and is there any way to solve it or does stream require much high power to run smoothly on such large models?
StreamingLLM is not noticeably slower than regular transformers
, but such large models are indeed quite slow to run. There are methods to speed this up, like quantization or using non-Python runners (e.g. llama.cpp), but they might not be compatible with the StreamingLLM approach out of the box.
@zeeshanali-k @narita63755930
You may need to reduce the max_gen_len
and recent_size
def streaming_inference(model, tokenizer, prompts, kv_cache=None, max_gen_len=1000):
pass
recent_size
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# ...
parser.add_argument("--recent_size", type=int, default=2000)
args = parser.parse_args()
because the cache size retain in GPU should be: k[0: start]
+ k[seq_len - recent_size: ]
Hi
https://colab.research.google.com/drive/1YtXE_JKVntkGK14Yo9thjCjPMVzhA71d?usp=sharing
Here is the colab, but it doesn't run in colab it stops after a while due to memory overload or something like that. Also there are few changes to be made in the files which are downloaded in the steps in order for it run, so you can't run it as it is.
If you already have a good solution, please support it.