intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.49k stars 1.24k forks source link

how to switch to load multiple llm models in a streamlit page? #11019

Open JamieVC opened 4 months ago

JamieVC commented 4 months ago

I hope to switch llama2-7b-chat and llama3-8b models. But it cost a lot of memory size if I load both. How to clear one if I am going to load the second model?

#model_name = 'meta-llama/Llama-2-7b-chat-hf'
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'

#tokenizer_name = 'meta-llama/Llama-2-7b-chat-hf'
tokenizer_name = 'meta-llama/Meta-Llama-3-8B-Instruct'

llm_model = IpexLLM.from_model_id( model_name=model_name, tokenizer_name=tokenizer_name, context_window=4096, max_new_tokens=512, load_in_low_bit='asym_int4', completion_to_prompt=completion_to_prompt, generate_kwargs={ "do_sample": True, 'temperature': 0.1, "eos_token_id": [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]},

messages_to_prompt=messages_to_prompt,

        device_map='xpu',
    )
sgwhat commented 4 months ago

You may clear the model with del llm_model.

JamieVC commented 4 months ago

Thanks for the good idea del llm_model , but I have another question. The create_model() is set @st.cache_resource like source code below. In my understandings, the function create_model() just run once. After I delete the old model, I'd like to create a new model with create_model(). How do I make it rerun?

@st.cache_resource
def create_model(model_name):
    llm_model = IpexLLM.from_model_id(
        model_name=model_name,
        tokenizer_name=tokenizer_name,
        context_window=4096,
        max_new_tokens=512,
        load_in_low_bit='asym_int4',
        completion_to_prompt=completion_to_prompt,
        generate_kwargs={
        "do_sample": True, 'temperature': 0.1,
        "eos_token_id": [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]},
        #messages_to_prompt=messages_to_prompt,
        device_map='xpu',
    )
sgwhat commented 4 months ago

You may use st.cache_resource.clear() to rerun to create a new model as below:

model = create_model(name1)

del model
st.cache_resource.clear()

model = create_model(name2)