mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
23.13k stars 1.75k forks source link

[p2p, cuda, compatibility] Compatibility issue between gpu and cpu instances in attempt to build p2p network #2735

Closed JackBekket closed 1 month ago

JackBekket commented 2 months ago

LocalAI version:

local-aio-gpu-nvidia-cuda-12 local-aio-cpu

Environment, CPU architecture, OS, and Version:

Describe the bug

I am trying to build p2p network, and it's actually working, peers can discover each other and exchange tasks (https://localai.io/features/distribute/)

There is a problem if you try to connect aio-cpu images to aio-gpu images and vice versa. It's looked like we can only have CPU networks and GPU networks, because if you launch local-ai-gpu as host and aio-cpu as worker, then the host will try to assemble backend on worker side and catch CUDA errors, because worker device have no gpu.

The same thing possibly work backwards, meaning if you have aio-cpu as a host and aio-gpu as a worker then GPU instance will get backend with cpu-only output.

4:10PM DBG GRPC(code-13b.Q5_K_M.gguf-127.0.0.1:39375): stderr GGML_ASSERT: /build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-cuda.cu:100: !"CUDA error"
4:10PM ERR Server error error="rpc error: code = Unavailable desc = error reading from server: EOF" ip=172.20.0.2 latency=17.191049654s method=POST status=500 url=/chat/completions

To Reproduce

Start local-ai-gpu-nvidia-cuda-12 as host node and local-ai-cpu as worker node

Expected behavior

Host side would understand that it's cpu instance and will not try to build gpu backend

Logs

Host side:
:10PM DBG GRPC(code-13b.Q5_K_M.gguf-127.0.0.1:39375): stderr ggml_cuda_compute_forward: RMS_NORM failed
4:10PM DBG GRPC(code-13b.Q5_K_M.gguf-127.0.0.1:39375): stderr CUDA error: the provided PTX was compiled with an unsupported toolchain.
4:10PM DBG GRPC(code-13b.Q5_K_M.gguf-127.0.0.1:39375): stderr   current device: 0, in function ggml_cuda_compute_forward at /build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-cuda.cu:2283
4:10PM DBG GRPC(code-13b.Q5_K_M.gguf-127.0.0.1:39375): stderr   err
4:10PM DBG GRPC(code-13b.Q5_K_M.gguf-127.0.0.1:39375): stderr GGML_ASSERT: /build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-cuda.cu:100: !"CUDA error"
4:10PM ERR Server error error="rpc error: code = Unavailable desc = error reading from server: EOF" ip=172.20.0.2 latency=17.191049654s method=POST status=500 url=/chat/completions
4:10PM DBG Searching for workers
Worker side:
Client connection closed

Additional context

JackBekket commented 2 months ago

If we are using libp2p, maybe we should consider add PubSub? Therefore we could exchange info messages between nodes using p2p