intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

Segmentation fault during Qlora fine tuning on ARC 770 #9412

Open tsantra opened 1 year ago

tsantra commented 1 year ago

Model: llama-2-7b-hf Ubuntu :22.04

xpu-smi discovery:

image

uname -r

image

Steps followed:

  1. Created conda env (followed instructions in Repo example)
  2. Then initialized oneAPI
image
  1. Used default dataset for model finetuning:
image
  1. sycl-ls info:

    image
  2. lscpu

image
  1. xpu-smi stats -d 1
image
rnwang04 commented 1 year ago

Yeah, I have reproduced this error. I found on your machine, inference works fine, but once finetuing, it will seg fault at rope. Still no idea about the root cause. By the way, below is our machine's linux version and driver version: image

Not sure whether this error is caused by above version mismatch.

@qiuxin2012 @yangw1234 any suggestions ?

tsantra commented 12 months ago

any suggestion on what to do next? Thank you!

rnwang04 commented 12 months ago

We have verified locally that 6.2.0-35-generic + Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26690] can run QLoRA on our local A770 machine. So it seems not version issue and I have checked the python package version is all the same. But I found that on your mechine, the output of sudo xpu-smi stats -d 0's output is a little strange: image

On our, it is: image

Not sure whether this issue is caused by some installation error or anything else.

rnwang04 commented 12 months ago

Below steps are how we setup our arc env on ubuntu 22.04.3, maybe you can refer to this.

Commands on ubuntu 22.04.3:
```bash
# install arc driver 
sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/graphics/intel-graphics.key | \
  sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo 'deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc' | \
  sudo tee  /etc/apt/sources.list.d/intel.gpu.jammy.list

# downgrade kernel

sudo apt-get update && sudo apt-get install  -y --install-suggests  linux-image-5.19.0-41-generic

sudo sed -i "s/GRUB_DEFAULT=.*/GRUB_DEFAULT=\"1> $(echo $(($(awk -F\' '/menuentry / {print $2}' /boot/grub/grub.cfg \
| grep -no '5.19.0-41' | sed 's/:/\n/g' | head -n 1)-2)))\"/" /etc/default/grub

sudo  update-grub

sudo reboot
# As 5.19's kernel doesn't has any arc graphic driver. The machine may not start the desktop correctly, but we can use the ssh to login. 
# Or you can select 5.19's recovery mode in the grub, then choose resume to resume the normal boot directly.

# remove latest kernel

sudo apt purge linux-image-6.2.0-*

sudo apt autoremove

sudo reboot

# install drivers

sudo apt-get update

sudo apt-get -y install \
    gawk \
    dkms \
    linux-headers-$(uname -r) \
    libc6-dev

sudo apt-get install -y intel-platform-vsec-dkms intel-platform-cse-dkms intel-i915-dkms intel-fw-gpu

sudo apt-get install -y gawk libc6-dev udev\
  intel-opencl-icd intel-level-zero-gpu level-zero \
  intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
  libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
  libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
  mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo

sudo reboot

# Configuring permissions

sudo gpasswd -a ${USER} render
newgrp render

# Verify the device is working with i915 driver
sudo apt-get install -y hwinfo
hwinfo --display

# install one api

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-basekit