Closed ElliottDyson closed 6 months ago
Hi,
We haven't tested on AMD CPU, but I suppose CPU may not be the bottleneck... You may follow our benchmarking guide to verify the performance: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/benchmark_quickstart.html
For long input prompts, this is due to the limitation of allocating large tensors. We are trying to fix it. You may follow up the issues in https://github.com/intel-analytics/ipex-llm/issues/10513 https://github.com/intel-analytics/ipex-llm/issues/10511
I'm having a similar issue - running Mistral Instruct v0.2 with an A770 16GB, I get no more than 10.4 tokens/s with the vllm_worker
and GPU utilisation hovers around 6%. Using model_worker
, it drops to 4-6 tokens/s.
Hardware spec is Ryzen 3600, 96GB RAM, A770 16GB, Rebar and >4G decoding enabled.
It's certainly nowhere near the speed of the demos (I'm comparing it visually to the demos here https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/continue_quickstart.html).
All my tests have been using Docker, which I guess adds another wrinkle (I'm not too familiar with the details of GPU passthrough, although I am passing --device /dev/dri
to the container - I presume this is sufficient?). I've tried it using both an image built based on Ubuntu base and the intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT
image - the result is the same.
If it helps, this is my Dockerfile:
FROM intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT
# Disable pip's cache behavior
ARG PIP_NO_CACHE_DIR=false
# Install Serving Dependencies
RUN cd /llm && \
pip install --pre --upgrade ipex-llm[serving] && \
pip install transformers==4.36.2 gradio==4.19.2
RUN apt-get update -y && apt-get install -y git-lfs
RUN pip3 install fschat
RUN pip3 install fschat[model_worker,webui] pydantic==1.10.15
VOLUME [ "/llm", "/root/.cache/huggingface" ]
COPY config.py /usr/local/lib/python3.9/dist-packages/ipex_llm/vllm/
COPY ./entrypoint.sh /opt/entrypoint.sh
RUN chmod +x /opt/entrypoint.sh
WORKDIR /llm/
EXPOSE 7860 8000
ENTRYPOINT [ "/opt/entrypoint.sh" ]
entrypoint.sh
is just the one from ipex-llm/docker/llm/serving/xpu/docker
.
@ElliottDyson - and, as soon as I posted the above, I found the problem. Follow the instructions here:
https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html
You need both the updated driver and the updated oneAPI installations, if your problem is the same as mine. This was the result, well over twice as fast and GPU usage went from 6% to pegged at 25%:
Screencast from 2024-04-07 11-28-07.webm
EDIT: Just been doing some testing using the Continue extension in vscode - it's averaging around 55t/s with Mistral-Instruct-7b-v0.2
, and 25t/s with laser-dolphin-mixtral-2x7b-dpo-AWQ
. That's way better than "acceptable", and I can't imagine it'll get much better with the current generation of Arc GPUs.
@ElliottDyson - and, as soon as I posted the above, I found the problem. Follow the instructions here:
https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html
You need both the updated driver and the updated oneAPI installations, if your problem is the same as mine. This was the result, well over twice as fast and GPU usage went from 6% to pegged at 25%:
Screencast from 2024-04-07 11-28-07.webm
EDIT: Just been doing some testing using the Continue extension in vscode - it's averaging around 55t/s with
Mistral-Instruct-7b-v0.2
, and 25t/s withlaser-dolphin-mixtral-2x7b-dpo-AWQ
. That's way better than "acceptable", and I can't imagine it'll get much better with the current generation of Arc GPUs.
That may be it actually. Whilst I do have the latest oneAPI installation as I followed this guide, when I updated my drivers I postponed the restart (I'm on windows) and simply forgot to do it before attempting all of this 🤦♂️.
I'll report back soon. Thank you
Hey, @ElliottDyson - just curious, did the driver update/restart solve this for you?
Hey, @ElliottDyson - just curious, did the driver update/restart solve this for you?
Hello, it did indeed solve the issue. Thanks
Hey, @ElliottDyson - just curious, did the driver update/restart solve this for you?
Hello, it did indeed solve the issue. Thanks
Ace! Glad to know my experience wasn't unique, then (EDIT: translation...at least I wasn't the only one ;) ).
Hello, I get extremely low GPU utilisation and noticed that during generation one of my CPU cores gets pinned to 100% with very low utilisation of all other cores. I can confirm it is definitely using the GPU compute, but whatever operations are being done on the CPU are very inefficient at the moment, maybe it's because it's an AMD CPU (2700x), but if this is the case, it would be nice to not have to buy a new CPU and motherboard in order to get decent performance from my GPU's inference.
I also have an issue where really long input prompts are greater than 4GB in size and therefore crash the program.
If there are any more details you need from me, just ask. Thank you.