huggingface / optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools
https://huggingface.co/docs/optimum/main/en/intel/index
Apache License 2.0
388 stars 110 forks source link

add IPEX-XPU support for Llama2 model Inference (greedy search) #701

Closed faaany closed 5 months ago

faaany commented 5 months ago

What does this PR do?

This PR enables Intel GPU support for Llama2 model inference in optimum-intel. This PR covers greedy search generation only. Below is an example:

import torch 
from transformers import AutoTokenizer, pipeline
from optimum.intel import IPEXModelForCausalLM

model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = IPEXModelForCausalLM.from_pretrained(model_id, device_map="xpu", torch_dtype=torch.float16, export=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, do_sample=False, num_beams=1, use_cache=True)
results = pipe("He's a dreadful magician and")
print(results)
faaany commented 5 months ago

close this PR due to messy commit history; pls go to https://github.com/huggingface/optimum-intel/pull/703 for more info.