mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P inference
https://localai.io
MIT License
25.83k stars 1.94k forks source link

Listen address and port ignored when using local-ai worker llama-cpp-rpc #3427

Closed sfxworks closed 2 months ago

sfxworks commented 2 months ago

LocalAI version: quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas

Environment, CPU architecture, OS, and Version: k8s,

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: local-ai-epyc7713
  namespace: local-ai
  labels:
    app: local-ai-epyc7713
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: local-ai-epyc7713
  replicas: 1
  template:
    metadata:
      labels:
        app: local-ai-epyc7713
      name: local-ai-epyc7713
    spec:
      nodeSelector:
        kubernetes.io/hostname: epyc7713
      containers:
        - env:
          - name: DEBUG
            value: "true"
          - name: LOCALAI_THREADS
            value: "4"
          args:
          - worker 
          - llama-cpp-rpc
          - "0.0.0.0"
          name: local-ai
          ports:
          - containerPort: 50052
            name: rpc
          image: quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              amd.com/gpu: "1"
              cpu: "4"
              memory: 16Gi
            requests:
              cpu: "4"
              memory: 4Gi
          volumeMounts:
            - name: models-volume
              mountPath: /build/models
      volumes:
        - name: models-volume
          persistentVolumeClaim:
            claimName: models-epyc7713

Describe the bug

When running local-ai worker llama-cpp-rpc, and trying to tell it to listen on all addresses, it fails.

To Reproduce

./local-ai worker llama-cpp-rpc -H 0.0.0.0 -P 50052
...
Usage: local-ai worker llama-cpp-rpc [<models> ...] [flags]

Starts a llama.cpp worker in standalone mode

Arguments:
  [<models> ...]    Model configuration URLs to load

Flags:
  -h, --help                   Show context-sensitive help.
      --log-level=LOG-LEVEL    Set the level of logs to output [error,warn,info,debug,trace] ($LOCALAI_LOG_LEVEL)

storage
  --backend-assets-path="/tmp/localai/backend_data"    Path used to extract libraries that are required by some of the backends in runtime ($LOCALAI_BACKEND_ASSETS_PATH,
                                                       $BACKEND_ASSETS_PATH)

local-ai: error: unknown flag -H, did you mean "-h"?

Expected behavior

root@local-ai-epyc7713-cpu-56685fd776-c9kk6:/build# /tmp/localai/backend_data/backend-assets/util/llama-cpp-rpc-server -H 0.0.0.0 -p 50053
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CPU backend
Starting RPC server on 0.0.0.0:50053, backend memory: 128643 MB

Logs Starts completely fine in the pod, just listens on localhost which is useless here.

kubectl logs -f -n local-ai local-ai-epyc7713-7dc95b889f-tlk68
===> LocalAI All-in-One (AIO) container starting...
AMD GPU detected
Non-NVIDIA GPU detected. Specific GPU memory size detection is not implemented.
===> Starting LocalAI[gpu-8g] with the following models: /aio/gpu-8g/embeddings.yaml,/aio/gpu-8g/rerank.yaml,/aio/gpu-8g/text-to-speech.yaml,/aio/gpu-8g/image-gen.yaml,/aio/gpu-8g/text-to-text.yaml,/aio/gpu-8g/speech-to-text.yaml,/aio/gpu-8g/vision.yaml
@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
CPU info:
model name      : AMD EPYC 7713 64-Core Processor
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
@@@@@
1:45AM INF env file found, loading environment variables from file envFile=.env
1:45AM DBG Setting logging to debug
1:45AM DBG Extracting backend assets files to /tmp/localai/backend_data
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon PRO W6800, compute capability 10.3, VMM: no
sfxworks commented 2 months ago

Current hack, just mount /tmp path of llama rpc and adjust container command

mudler commented 2 months ago

@sfxworks did you tried with passing -- in the args?

E.g. this works here

./local-ai worker llama-cpp-rpc -- -H 1.1.1.1 -p 50052 -m 20
9:49AM INF env file found, loading environment variables from file envFile=.env
9:49AM INF Setting logging to info
create_backend: using CPU backend
Starting RPC server on 1.1.1.1:50052, backend memory: 20 MB
mudler commented 2 months ago

there is a bit of misalignment of how it works without p2p. My bad as I neglected this piece as I usually run it with P2P. I think would be better to be consistent with both commands and then have:

./local-ai worker llama-cpp-rpc --llama-cpp-args="-H 1.1.1.1 -p 50052 -m 20"

I've opened https://github.com/mudler/LocalAI/pull/3428 to make it more consistent. Thanks for pointing it out @sfxworks

sfxworks commented 2 months ago

Gotcha, and np thank you!