mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.59k stars 361 forks source link

Implementation of lama2 7b chat hf model #50

Open MuhammadIshaq-AI opened 11 months ago

MuhammadIshaq-AI commented 11 months ago

How can i integrate the lama2 7b model through this streaming llm, the model is already pretrained version, will it work over here?

Guangxuan-Xiao commented 11 months ago

Hello,

Certainly! The pre-trained Llama-2-7B-chat model can be integrated using the streaming LLM method. To run the Llama-2-7B-chat model with streaming enabled, use the following command:

CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming --model_name_or_path meta-llama/Llama-2-7b-chat-hf

Guangxuan

MuhammadIshaq-AI commented 11 months ago

Can i use the model for deployment purpose?

Guangxuan-Xiao commented 11 months ago

Yes, you can!

MuhammadIshaq-AI commented 11 months ago

Thank you so much for your quick response to every question. One last thing I want to clarify is, that whenever I call the model, it is being downloaded and it takes so much time for the model to download its weight using the API access token or just calling it. My question is, if i can download the weights from huggingface and then use it locally in the vscode for the streaming llm, how can i do that, please do guide me about the paths etc.

MuhammadIshaq-AI commented 11 months ago

Need your immediate assistance

Guangxuan-Xiao commented 11 months ago

It is probably because your system doesn't have a fixed cache folder. You can download the model to a folder such as path_to_model, and use

CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming --model_name_or_path path_to_model
MuhammadIshaq-AI commented 11 months ago

I am trying to implement lama2 model with a vector database to fetch queries, i want to use my model to interact with the vectordb using the streaming-llm, i am getting this error when i send a query to the vectordb.

\streaming-llm>python examples/run_streaming_llama.py --enable_streaming Loading model from lama2weights ... Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 17.34it/s] Enter your question (or type 'exit' to quit): hi USER: hi Traceback (most recent call last): File "examples/run_streaming_llama.py", line 131, in main(args) File "examples/run_streaming_llama.py", line 118, in main streaming_inference(model, tokenizer, prompts, pinecone_index_name) File "examples/run_streaming_llama.py", line 81, in streaming_inference query_results = pinecone_index.query(queries=[user_query], top_k=3) File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\utils\error_handling.py", line 17, in inner_func return func(*args, *kwargs) File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\index.py", line 455, in query response = self._vector_api.query( File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\api_client.py", line 776, in call return self.callable(self, args, kwargs) File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\api\vector_operations_api.py", line 716, in __query return self.call_with_http_info(kwargs) File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\api_client.py", line 838, in call_with_http_info return self.api_client.call_api( File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\api_client.py", line 413, in call_api return self.__call_api(resource_path, method, File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\api_client.py", line 207, in call_api raise e File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\api_client.py", line 200, in call_api response_data = self.request( File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\api_client.py", line 459, in request return self.rest_client.POST(url, File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\rest.py", line 271, in POST return self.request("POST", url, File "C:\Users\Zara\anaconda3\envs\streaming\lib\site-packages\pinecone\core\client\rest.py", line 230, in request raise ApiException(http_resp=r) pinecone.core.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain', 'content-length': '57', 'date': 'Fri, 03 Nov 2023 06:46:04 GMT', 'server': 'envoy', 'connection': 'close'}) HTTP response body: queries[0].values: invalid value "hi" for type TYPE_FLOAT