intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.64k stars 1.26k forks source link

Heavy CPU bottleneck when working with Intel ARC A770 16GB GPU Inference #10668

Closed ElliottDyson closed 6 months ago

ElliottDyson commented 6 months ago

Hello, I get extremely low GPU utilisation and noticed that during generation one of my CPU cores gets pinned to 100% with very low utilisation of all other cores. I can confirm it is definitely using the GPU compute, but whatever operations are being done on the CPU are very inefficient at the moment, maybe it's because it's an AMD CPU (2700x), but if this is the case, it would be nice to not have to buy a new CPU and motherboard in order to get decent performance from my GPU's inference.

I also have an issue where really long input prompts are greater than 4GB in size and therefore crash the program.

If there are any more details you need from me, just ask. Thank you.

hkvision commented 6 months ago

Hi,

We haven't tested on AMD CPU, but I suppose CPU may not be the bottleneck... You may follow our benchmarking guide to verify the performance: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/benchmark_quickstart.html

For long input prompts, this is due to the limitation of allocating large tensors. We are trying to fix it. You may follow up the issues in https://github.com/intel-analytics/ipex-llm/issues/10513 https://github.com/intel-analytics/ipex-llm/issues/10511

digitalscream commented 6 months ago

I'm having a similar issue - running Mistral Instruct v0.2 with an A770 16GB, I get no more than 10.4 tokens/s with the vllm_workerand GPU utilisation hovers around 6%. Using model_worker, it drops to 4-6 tokens/s.

Hardware spec is Ryzen 3600, 96GB RAM, A770 16GB, Rebar and >4G decoding enabled.

It's certainly nowhere near the speed of the demos (I'm comparing it visually to the demos here https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/continue_quickstart.html).

All my tests have been using Docker, which I guess adds another wrinkle (I'm not too familiar with the details of GPU passthrough, although I am passing --device /dev/dri to the container - I presume this is sufficient?). I've tried it using both an image built based on Ubuntu base and the intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT image - the result is the same.

If it helps, this is my Dockerfile:

FROM intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT

# Disable pip's cache behavior
ARG PIP_NO_CACHE_DIR=false

# Install Serving Dependencies
RUN cd /llm && \
    pip install --pre --upgrade ipex-llm[serving] && \
    pip install transformers==4.36.2 gradio==4.19.2

RUN apt-get update -y && apt-get install -y git-lfs
RUN pip3 install fschat
RUN pip3 install fschat[model_worker,webui] pydantic==1.10.15

VOLUME [ "/llm", "/root/.cache/huggingface" ]

COPY config.py /usr/local/lib/python3.9/dist-packages/ipex_llm/vllm/
COPY ./entrypoint.sh /opt/entrypoint.sh
RUN chmod +x /opt/entrypoint.sh

WORKDIR /llm/
EXPOSE 7860 8000
ENTRYPOINT [ "/opt/entrypoint.sh" ]

entrypoint.sh is just the one from ipex-llm/docker/llm/serving/xpu/docker.

digitalscream commented 6 months ago

@ElliottDyson - and, as soon as I posted the above, I found the problem. Follow the instructions here:

https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html

You need both the updated driver and the updated oneAPI installations, if your problem is the same as mine. This was the result, well over twice as fast and GPU usage went from 6% to pegged at 25%:

Screencast from 2024-04-07 11-28-07.webm

EDIT: Just been doing some testing using the Continue extension in vscode - it's averaging around 55t/s with Mistral-Instruct-7b-v0.2, and 25t/s with laser-dolphin-mixtral-2x7b-dpo-AWQ. That's way better than "acceptable", and I can't imagine it'll get much better with the current generation of Arc GPUs.

ElliottDyson commented 6 months ago

@ElliottDyson - and, as soon as I posted the above, I found the problem. Follow the instructions here:

https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html

You need both the updated driver and the updated oneAPI installations, if your problem is the same as mine. This was the result, well over twice as fast and GPU usage went from 6% to pegged at 25%:

Screencast from 2024-04-07 11-28-07.webm

EDIT: Just been doing some testing using the Continue extension in vscode - it's averaging around 55t/s with Mistral-Instruct-7b-v0.2, and 25t/s with laser-dolphin-mixtral-2x7b-dpo-AWQ. That's way better than "acceptable", and I can't imagine it'll get much better with the current generation of Arc GPUs.

That may be it actually. Whilst I do have the latest oneAPI installation as I followed this guide, when I updated my drivers I postponed the restart (I'm on windows) and simply forgot to do it before attempting all of this 🤦‍♂️.

I'll report back soon. Thank you

digitalscream commented 6 months ago

Hey, @ElliottDyson - just curious, did the driver update/restart solve this for you?

ElliottDyson commented 6 months ago

Hey, @ElliottDyson - just curious, did the driver update/restart solve this for you?

Hello, it did indeed solve the issue. Thanks

digitalscream commented 6 months ago

Hey, @ElliottDyson - just curious, did the driver update/restart solve this for you?

Hello, it did indeed solve the issue. Thanks

Ace! Glad to know my experience wasn't unique, then (EDIT: translation...at least I wasn't the only one ;) ).