Closed faaany closed 5 months ago
This PR enables Intel GPU support for Llama2 model inference in optimum-intel. This PR covers greedy search generation only. Below is an example:
import torch from transformers import AutoTokenizer, pipeline from optimum.intel import IPEXModelForCausalLM model_id = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_id) model = IPEXModelForCausalLM.from_pretrained(model_id, device_map="xpu", torch_dtype=torch.float16, export=True) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, do_sample=False, num_beams=1, use_cache=True) results = pipe("He's a dreadful magician and") print(results)
close this PR due to messy commit history; pls go to https://github.com/huggingface/optimum-intel/pull/703 for more info.
What does this PR do?
This PR enables Intel GPU support for Llama2 model inference in optimum-intel. This PR covers greedy search generation only. Below is an example: