:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P inference
Environment, CPU architecture, OS, and Version:
Docker version 27.3.1, build ce12230 running localai:latest-gpu-hipblas, on Ubuntu 22.04.5 LTS (6.11.0-x64v4-xanmod) AMD EPYC 9000 series CPU and AMD Radeon RX 7800 XT
Describe the bug:
A weird behavior with the release image ( localai/localai:latest-gpu-hipblas = 2.21.1) where there is partial GPU functionality when using AIO defined models and specifying llama-backend-grpc.
It is possible to go into the docker image and run the full HIP conformance test suite successfully (manually)
Embedding Generation (embeddings.yaml) runs with GPU acceleration
Audio transcription (speech-to-text.yaml) runs with whisper GPU acceleration
Image generation (image-gen.yaml) run with GPU stablediffusion
BUT, nothing based on llama-cpp runs on the GPU. It just hangs with 100% CPU utilization at the stage of loading the model.
gpt-4 (text-to-text.yaml) with backend=llama-cpp-grpc
10:07PM INF Loading model 'Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf' with backend llama-cpp-fallback
10:07PM DBG Loading model in memory from file: /build/models/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf
10:07PM DBG GRPC(Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf-127.0.0.1:43109): stderr llm_load_print_meta: max token length = 256
.... CPU 100%
Next step would be loading the model... which it never comes to. Just stays at 100% CPU forever.
The same behavior is observed for all combinations of
- llama-cpp-fallback
- llama-cpp-grpc
- llama-cpp-hipblas
which is to be expected since its the same file. However, it is weird that the llama-cpp-fallback also locks up in exactly the same way given that it should run on the CPU, but, yet, is trying to run on the GPU.
Rebuilding local-ai does not solve it either.
However, at least building the llama-cpp-fallback without BUILD_TYPE=hipblas,
i.e., with make backend-assets/grpc/llama-cpp-fallback enables the CPU version of llama-cpp to run.
To Reproduce
There seems to be a bug in the docker image since it presents the user with an error unless adding the runtime rocm llvm openmp libraries (LD_LIBRARY_PATH=/opt/rocm/lib/llvm/lib) at start (openmp-extras-runtime package doesn't setup ldpaths)
--> ERROR libomp.so not found.
Fix start fix,
docker pull localai/localai:latest-gpu-hipblas
docker run -ti --rm \
--privileged \
-p 8080:8080 \
-e DEBUG=true \
-e LD_LIBRARY_PATH=/opt/rocm/lib/llvm/lib
--security-opt seccomp=unconfined \
--device /dev/dri \
--device /dev/kfd \
--group-add video \
-v /mnt/raid6/local-ai/models:/build/models \
localai/localai:latest-gpu-hipblas
Expected behavior
Since gfx1100,gfx1101 is supported, llama-cpp should run on the GPU. Since it is possible to run the entire HIP conformance test manully from the image, the GPU is definately working. As would the successful operation of whisper, embeddings, stablediffusion indicate.
LocalAI version: v2.21.1 (33b2d38dd0198d78dbc26aa020acfb6ff4c4048c) localai/localai:latest-gpu-hipblas
Environment, CPU architecture, OS, and Version: Docker version 27.3.1, build ce12230 running localai:latest-gpu-hipblas, on Ubuntu 22.04.5 LTS (6.11.0-x64v4-xanmod) AMD EPYC 9000 series CPU and AMD Radeon RX 7800 XT
Describe the bug: A weird behavior with the release image ( localai/localai:latest-gpu-hipblas = 2.21.1) where there is partial GPU functionality when using AIO defined models and specifying llama-backend-grpc.
BUT, nothing based on llama-cpp runs on the GPU. It just hangs with 100% CPU utilization at the stage of loading the model.
Rebuilding local-ai does not solve it either. However, at least building the llama-cpp-fallback without BUILD_TYPE=hipblas, i.e., with make backend-assets/grpc/llama-cpp-fallback enables the CPU version of llama-cpp to run.
To Reproduce
Expected behavior Since gfx1100,gfx1101 is supported, llama-cpp should run on the GPU. Since it is possible to run the entire HIP conformance test manully from the image, the GPU is definately working. As would the successful operation of whisper, embeddings, stablediffusion indicate.
Logs
Additional context
Have someone else seen something similar?