pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.58k stars 508 forks source link

intel gpu : enable intel gpu #79

Open xiaowangintel opened 8 months ago

xiaowangintel commented 8 months ago

This PR adds the initial Intel GPU support in GPT-fast with the device option "xpu" (i.e., --device "xpu"). Both single device and multi-device via tensor parallel are supported functionally while performance is still being improved. Refer to the following steps to run the generation on Intel GPU. We will update the tutorial later with improved performance.

Installation

  1. Install pytorch and Intel® Extension for PyTorch: https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/introduction.html#
  2. install oneCCL for distributed: https://github.com/oneapi-src/oneCCL
  3. install Intel® Extension for Triton (needed by torch.compile): https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/features/torch_compile_gpu.html

How to run gpt-fast code on intel GPUs?

  1. command for single device: python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --speculate_k 5 --prompt "Hi my name is" --device xpu
  2. command for multiple devices via Tensor Parallelism: ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=2 generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --device xpu

Note:

  1. Please export UR_L0_IN_ORDER_BARRIER_BY_SIGNAL=0, a temporary configuration, to avoid unnecessary errors, when runs gpt-fast code with torch.compile.
  2. Please export IPEX_ZE_TRACING=1, a temporary configuration, to get event, when runs gpt-fast code with profile.
  3. Currently, only bf16 is supported, and int4/int8 will be supported later via IPEX without requiring code changes in gpt-fast.
jgong5 commented 8 months ago

Please add to the PR description 1) how to build/install the pre-requisite software components; 2) how to run inference with and without tensor parallel.

jgong5 commented 8 months ago

@Chillee This is the initial PR to support Intel GPU. Most needed code changes should be there. Further performance optimizations will be applied inside IPEX. May I ask your review? Thanks!