Inferencing not working with P2P in latest version.

j4ys0n commented 3 weeks ago

LocalAI version:

localai/localai:latest-gpu-nvidia-cuda-12 LocalAI version: v2.22.1 (015835dba2854572d50e167b7cade05af41ed214)

Environment, CPU architecture, OS, and Version:

Linux localai3 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64 GNU/Linux (Proxmox LXC, Debian. AMD EPYC 7302P (16 cores allocated)/64GB RAM

Describe the bug

When testing distributed inferencing, i select a model (qwen 2.5 14b), send a chat message, the model loads on both instances (main and worker) and then the model does not respond and the model unloads on the worker. (watching with nvitop)

To Reproduce

description above should reproduce, i tried a few times.

Expected behavior

model should not unload & chat should complete

Logs

worker logs

{"level":"INFO","time":"2024-10-26T05:07:23.924Z","caller":"discovery/dht.go:115","message":" Bootstrapping DHT"}
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes
Starting RPC server on 127.0.0.1:46609, backend memory: 16380 MB
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed

Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed

main logs

5:25AM INF Success ip=my.ip.address latency="960.876µs" method=POST status=200 url=/v1/chat/completions
5:25AM INF Trying to load the model 'qwen2.5-14b-instruct' with the backend '[llama-cpp llama-ggml llama-cpp-fallback rwkv stablediffusion whisper piper huggingface bert-embeddings /build/backend/python/rerankers/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/mamba/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/coqui/run.sh /build/backend/python/bark/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/transformers/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/vllm/run.sh]'
5:25AM INF [llama-cpp] Attempting to load
5:25AM INF Loading model 'qwen2.5-14b-instruct' with backend llama-cpp
5:25AM INF [llama-cpp-grpc] attempting to load with GRPC variant
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Success ip=127.0.0.1 latency="35.55µs" method=GET status=200 url=/readyz
5:26AM INF Node localai-oYURMqpWCR is offline, deleting
Error accepting:  accept tcp 127.0.0.1:35625: use of closed network connection

Additional context

this worked in the last version, though i'm not sure what that was at this point (~2 weeks ago) model loads and works fine without the worker.

j4ys0n commented 3 weeks ago

i'm using docker compose, here's the config. https://github.com/j4ys0n/local-ai-stack

JackBekket commented 6 days ago

it is related to edgevpn, it somehow see peer's and addresses but cannot connect with them.

I has tried NAT traversal using libp2p+pubsub and I have managed to make peer discovery and establish p2p connection by randez-vous point.

In your case, if you know your worker address you can just put worker address into ENV of local-ai as gRPC external backend addresses.

mudler / LocalAI

Inferencing not working with P2P in latest version. #3968