mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
24.33k stars 1.86k forks source link

Only use 4 CPU threads in P2P worker cluster #3410

Open titogrima opened 2 months ago

titogrima commented 2 months ago

LocalAI version: v2.20.1-ffmpeg-core docker image for two workers and latest-aio-cpu for master

Environment, CPU architecture, OS, and Version: Cluster P2P lab in docker machine with heteregeneus CPU AMD64 and ARM Linux clusteria1 6.6.45-0-virt #1-Alpine SMP PREEMPT_DYNAMIC 2024-08-13 08:10:32 aarch64 Linux 8 CPU 7 GB RAM run one worker Linux ia 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux 12 CPU 10 GB RAM run master and one worker

Describe the bug When use P2P workers mode work fine but always use only 4 CPU to inference in each node, I try env file with LOCALAI_THREADS=12 and --threads 12 in 12 CPU node and LOCALAI_THREADS=7 and --threads 7 in 8 CPU node Else try THREADS variable in env file If only run a master without workers work without any problem THREADS variable

To Reproduce Launch a P2P worker cluster and set threads distinct from 4 threads

Expected behavior Node use the threads defined

Logs Logs from one worker

create_backend: using CPU backend Starting RPC server on 127.0.0.1:37885, backend memory: 9936 MB ^C@@@@@ Skipping rebuild @@@@@ If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed: CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF" see the documentation at: https://localai.io/basics/build/index.html Note: See also https://github.com/go-skynet/LocalAI/issues/288 @@@@@ CPU info: model name : AMD Ryzen 9 5900X 12-Core Processor flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities CPU: AVX found OK CPU: AVX2 found OK CPU: no AVX512 found @@@@@ 9:22PM INF env file found, loading environment variables from file envFile=.env 9:22PM DBG Setting logging to debug 9:22PM DBG Extracting backend assets files to /tmp/localai/backend_data {"level":"INFO","time":"2024-08-26T21:22:10.977Z","caller":"config/config.go:288","message":"connmanager disabled\n"} {"level":"INFO","time":"2024-08-26T21:22:10.977Z","caller":"config/config.go:292","message":" go-libp2p resource manager protection disabled"} 9:22PM INF Starting llama-cpp-rpc-server on '127.0.0.1:34015' {"level":"INFO","time":"2024-08-26T21:22:10.978Z","caller":"node/node.go:118","message":" Starting EdgeVPN network"} create_backend: using CPU backend Starting RPC server on 127.0.0.1:34015, backend memory: 9936 MB 2024/08/26 21:22:10 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 7168 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details. {"level":"INFO","time":"2024-08-26T21:22:10.987Z","caller":"node/node.go:172","message":" Node ID: 12D3KooWFvq7aNHpre5tyQDZN9Gn2tZh84E3Vf9tfBuCmB5ULJSB"} {"level":"INFO","time":"2024-08-26T21:22:10.987Z","caller":"node/node.go:173","message":" Node Addresses: [/ip4/127.0.0.1/tcp/41065 /ip4/127.0.0.1/udp/43346/quic-v1/webtransport/certhash/uEiA46crpiIhxfL7skSKai7WxlGHkv8mZNXzAYoogm_qhow/certhash/uEiCt91_kaygLCTKWpqX6PEOTzb617BIH7KHDTRrw_eyurw /ip4/127.0.0.1/udp/47629/webrtc-direct/certhash/uEiDbmMPnLfeQJBvFRcfp-zDNXx-_CjljBg0ia3Nr20Xs7g /ip4/127.0.0.1/udp/59911/quic-v1 /ip4/192.168.XX.XX/tcp/41065 /ip4/192.168.XX.XX/udp/43346/quic-v1/webtransport/certhash/uEiA46crpiIhxfL7skSKai7WxlGHkv8mZNXzAYoogm_qhow/certhash/uEiCt91_kaygLCTKWpqX6PEOTzb617BIH7KHDTRrw_eyurw /ip4/192.168.XX.XX/udp/47629/webrtc-direct/certhash/uEiDbmMPnLfeQJBvFRcfp-zDNXx-_CjljBg0ia3Nr20Xs7g /ip4/192.168.XX.XX/udp/59911/quic-v1 /ip6/::1/tcp/33785 /ip6/::1/udp/46892/webrtc-direct/certhash/uEiDbmMPnLfeQJBvFRcfp-zDNXx-_CjljBg0ia3Nr20Xs7g /ip6/::1/udp/49565/quic-v1/webtransport/certhash/uEiA46crpiIhxfL7skSKai7WxlGHkv8mZNXzAYoogm_qhow/certhash/uEiCt91_kaygLCTKWpqX6PEOTzb617BIH7KHDTRrw_eyurw /ip6/::1/udp/59078/quic-v1 /ip6/fda7:761c:127e:4::26/tcp/33785 /ip6/fda7:761c:127e:4::26/udp/46892/webrtc-direct/certhash/uEiDbmMPnLfeQJBvFRcfp-zDNXx-_CjljBg0ia3Nr20Xs7g /ip6/fda7:761c:XXXX:XX::XX/udp/49565/quic-v1/webtransport/certhash/uEiA46crpiIhxfL7skSKai7WxlGHkv8mZNXzAYoogm_qhow/certhash/uEiCt91_kaygLCTKWpqX6PEOTzb617BIH7KHDTRrw_eyurw /ip6/fda7:761c:XXX:XX::XX/udp/59078/quic-v1]"} {"level":"INFO","time":"2024-08-26T21:22:10.987Z","caller":"discovery/dht.go:104","message":" Bootstrapping DHT"} Accepted client connection, free_mem=10418868224, total_mem=10418868224 Client connection closed Accepted client connection, free_mem=10418868224, total_mem=10418868224 Client connection closed Accepted client connection, free_mem=10418868224, total_mem=10418868224 Client connection closed

Additional context

mudler commented 2 months ago

Hey @titogrima - LocalAI doesn't set any thread when running in p2p mode. This sounds more like a bug in llama.cpp as we just run the vanilla rpc service from the llama.cpp project, did you check if there are bugs relative to that upstream?

titogrima commented 2 months ago

Hi!

I checked llama.cpp repo https://github.com/ggerganov/llama.cpp but don't see any issue with this problem but if LocalAI don't set any thread in p2p mode maybe is better open issue in llama.cpp repo I'm going to investigate this issue further but it's helpful to know that LocalAI doesn't set threads in p2p mode, maybe I can set threads directly in llama.cpp

Thanks and sorry for my english XD!!

mudler commented 2 months ago

Also might be worth noting that you can pass any command options of llama.cpp from LocalAI with --llama-cpp-args or LOCALAI_EXTRA_LLAMA_CPP_ARGS, from the --help output:

./local-ai worker p2p-llama-cpp-rpc --help                        
Usage: local-ai worker p2p-llama-cpp-rpc [flags]

Starts a LocalAI llama.cpp worker in P2P mode (requires a token)

Flags:
  -h, --help                     Show context-sensitive help.
      --log-level=LOG-LEVEL      Set the level of logs to output [error,warn,info,debug,trace]
                                 ($LOCALAI_LOG_LEVEL)

      --token=STRING             P2P token to use ($LOCALAI_TOKEN, $LOCALAI_P2P_TOKEN, $TOKEN)
      --no-runner                Do not start the llama-cpp-rpc-server ($LOCALAI_NO_RUNNER, $NO_RUNNER)
      --runner-address=STRING    Address of the llama-cpp-rpc-server ($LOCALAI_RUNNER_ADDRESS,
                                 $RUNNER_ADDRESS)
      --runner-port=STRING       Port of the llama-cpp-rpc-server ($LOCALAI_RUNNER_PORT, $RUNNER_PORT)
      --llama-cpp-args=STRING    Extra arguments to pass to llama-cpp-rpc-server
                                 ($LOCALAI_EXTRA_LLAMA_CPP_ARGS, $EXTRA_LLAMA_CPP_ARGS)
titogrima commented 2 months ago

Hi!

I tried this LOCALAI_EXTRA_LLAMA_CPP_ARGS=--threads 7 https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#number-of-threads But llama-cpp-rpc-server only support this arguments

11:12AM INF Starting llama-cpp-rpc-server on '127.0.0.1:35291' error: unknown argument: --threads 7 Usage: /tmp/localai/backend_data/backend-assets/util/llama-cpp-rpc-server [options]

options: -h, --help show this help message and exit -H HOST, --host HOST host to bind to (default: 127.0.0.1) -p PORT, --port PORT port to bind to (default: 35291) -m MEM, --mem MEM backend memory size (in MB)

And fail to boot with threads option

titogrima commented 2 months ago

Well

Reading code of llama.cpp the "problem" is that in rcp-server code https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/rpc-server.cpp When initialized cpu backend line 87 call ggml_backend_cpu_init() function in gglm code https://github.com/ggerganov/llama.cpp/blob/20f1789dfb4e535d64ba2f523c64929e7891f428/ggml/src/ggml-backend.c#L869 line 869 and this function have a GGML_DEFAULT_N_THREADS variable thats is 4 in headers gglm file https://github.com/ggerganov/llama.cpp/blob/20f1789dfb4e535d64ba2f523c64929e7891f428/ggml/include/ggml.h#L236 line 236 Maybe I can recompiling it with GGML_DEFAULT_N_THREADS change or similar

Thanks for your help!!

titogrima commented 2 months ago

I recompiled ggml with /build/backend/cpp/llama/llama.cpp/ggml/include/ggml.h variable GGML_DEFAULT_N_THREADS change and works Obviously is not the best solution but works....

Regards!