Better Support for AMD and ROCM via docker containers.

jamiemoller commented 5 months ago

Presently it is very hard to get a docker container to build with the rocm backend, some elements seem to fail independently during the build process. There are other related projects with functional docker implementations that do work with rocm out of the box (aka llama.cpp). I would like to work on this myself however between the speed at which things change in this project and the amount of time I have free to work on this, I am left only to ask for this.

If there are good 'stable' methods for building a docker implementation with rocm underneath already it would be very appreciated if this could be better documented. 'arch' helps nobody that wants to run on a more enterprisy os like rhel or sles.

Presently I have defaulted back to using textgen as it has a mostly functional api but its featureset is kinda woeful. (better than running llama.cpp directly imo)

jamiemoller commented 5 months ago

ps. love the work @mudler

jamiemoller commented 5 months ago

it should be noted 1 - the documentation for rocm for some reason indicates make BUILD_TYPE=hipblas GPU_TARGETS=gfx1030 ... there is no build arg 2 - stablediffusion is the hardest thing to get working in any environment ive tested. as in i have yet to actually get it to build on arch, deb, or opensuse 3 - the following dockerfile is the smoothest ive had it build so far

FROM archlinux

# Install deps
# ncnn not required as stablediffusion build is broken
RUN pacman -Syu --noconfirm
RUN pacman -S --noconfirm base-devel git rocm-hip-sdk rocm-opencl-sdk opencv clblast grpc go ffmpeg ncnn

# Configure Lib links
ENV CGO_CFLAGS="-I/usr/include/opencv4" \
    CGO_CXXFLAGS="-I/usr/include/opencv4" \
    CGO_LDFLAGS="-L/opt/rocm/hip/lib -lamdhip64 -L/opt/rocm/lib -lOpenCL -L/usr/lib -lclblast -lrocblas -lhipblas -lrocrand -lomp -O3 --rtlib=compiler-rt -unwindlib=libgcc -lhipblas -lrocblas --hip-link"

# Configure Build settings
ARG BUILD_TYPE="hipblas"
ARG GPU_TARGETS="gfx906" # selected for RadeonVII
ARG GO_TAGS="tts" # stablediffusion is broken

# Build
RUN git clone https://github.com/go-skynet/LocalAI
WORKDIR /LocalAI
RUN make BUILD_TYPE=${BUILD_TYPE} GPU_TARGETS=${GPU_TARGETS} GO_TAGS=${GO_TAGS} build

# Clean up
RUN pacman -Scc --noconfirm

jamiemoller commented 5 months ago

it should be noted that while i do see models load onto the card whenever there is an api call and there are computations being performed pushing the card to 200W of consumption there is never any return from the api call and the apparent inference never terminates

mudler commented 5 months ago

Presently it is very hard to get a docker container to build with the rocm backend, some elements seem to fail independently during the build process. There are other related projects with functional docker implementations that do work with rocm out of the box (aka llama.cpp). I would like to work on this myself however between the speed at which things change in this project and the amount of time I have free to work on this, I am left only to ask for this.

I don't have an AMD card to test, so this card is up-for-grabs.

Things are moving fast, right, but building-wise this is a good time window, there aren't plans to do changes in that code area in the short-term.

If there are good 'stable' methods for building a docker implementation with rocm underneath already it would be very appreciated if this could be better documented. 'arch' helps nobody that wants to run on a more enterprisy os like rhel or sles.

A good starting point would be in this section: https://github.com/mudler/LocalAI/blob/9c2d2649796907006568925d96916437f5845aac/Dockerfile#L159 we can pull RocM dependencies in there if the appropriate flag was passed by

wuxxin commented 5 months ago

@jamiemoller you could use https://github.com/wuxxin/aur-packages/blob/main/localai-git/PKGBUILD as a starting point, its a (feature limited) archlinux package of localai for CPU, CUDA and ROCM. There are binaries available via arch4edu. See https://github.com/mudler/LocalAI/issues/1437

Expro commented 5 months ago

Please do work on that. I'm trying to put any load on AMD GPU for week now. Building from source on Ubuntu for clBlast fails in so many ways it's not even funny.

jamiemoller commented 4 months ago

i have a feeling that it will be better to start from here (or something) for amd builds now that 2.8 is on the ubu22.04

mudler commented 4 months ago

did some progress on https://github.com/mudler/LocalAI/pull/1595 (thanks to @fenfir to have started this up) but I don't have an AMD video card, however CI seems to pass and container images are being built just fine.

I will merge as soon as the v2.8.2 images are out - @jamiemoller @Expro could you give the images a shot as soon as they are on master?

Expro commented 4 months ago

Sure, I will take them for spin. Thanks for working on that.

mudler commented 4 months ago

hipblas images are pushed by now:

quay.io/go-skynet/local-ai:master-hipblas-ffmpeg-core

Expro commented 4 months ago

Unfortunately, not working as intended. GPU was detected, but nothing was offloaded:

4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr ggml_init_cublas: found 1 ROCm devices: 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /build/models/c0c3c83d0ec33ffe925657a56b06771b (version GGUF V3 (latest)) 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 0: general.architecture str = phi2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 1: general.name str = Phi2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 2: phi2.context_length u32 = 2048 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 5: phi2.block_count u32 = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 10: general.file_type u32 = 7 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ... 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - kv 19: general.quantization_version u32 = 2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - type f32: 195 tensors 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_model_loader: - type q8_0: 130 tensors 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ). 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: format = GGUF V3 (latest) 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: arch = phi2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: vocab type = BPE 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_vocab = 51200 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_merges = 50000 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_ctx_train = 2048 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd = 2560 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_head = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_head_kv = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_layer = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_rot = 32 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd_head_k = 80 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd_head_v = 80 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_gqa = 1 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd_k_gqa = 2560 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_embd_v_gqa = 2560 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: f_norm_eps = 1.0e-05 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: f_norm_rms_eps = 0.0e+00 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_ff = 10240 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_expert = 0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_expert_used = 0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: rope scaling = linear 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: freq_base_train = 10000.0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: freq_scale_train = 1 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: n_yarn_orig_ctx = 2048 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: rope_finetuned = unknown 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: model type = 3B 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: model ftype = Q8_0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: model params = 2.78 B 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: model size = 2.75 GiB (8.51 BPW) 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: general.name = Phi2 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: BOS token = 50256 '<|endoftext|>' 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: EOS token = 50256 '<|endoftext|>' 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: UNK token = 50256 '<|endoftext|>' 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_print_meta: LF token = 128 'Ä' 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_tensors: ggml ctx size = 0.12 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_tensors: offloading 0 repeating layers to GPU 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_tensors: offloaded 0/33 layers to GPU 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llm_load_tensors: ROCm_Host buffer size = 2819.28 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr ............................................................................................. 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: n_ctx = 512 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: freq_base = 10000.0 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: freq_scale = 1 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_kv_cache_init: ROCm_Host KV buffer size = 160.00 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: ROCm_Host input buffer size = 6.01 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: ROCm_Host compute buffer size = 115.50 MiB 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr llama_new_context_with_model: graph splits (measure): 1 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr Available slots: 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr -> Slot 0 - max context: 512 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr all slots are idle and system prompt is empty, clear the KV cache 4:14PM INF [llama-cpp] Loads OK 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr slot 0 is processing [task id: 0] 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr slot 0 : kv cache rm - [0, end) 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr CUDA error: shared object initialization failed 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr current device: 0, in function ggml_cuda_op_mul_mat at /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:9462 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr hipGetLastError() 4:14PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:41425): stderr GGML_ASSERT: /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:241: !"CUDA error"

Tested with integrated phi-2 model with gpu_layers specified:

` name: phi-2 context_size: 2048 f16: true gpu_layers: 90 mmap: true trimsuffix:

"\n" parameters: model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf temperature: 0.2 top_k: 40 top_p: 0.95 seed: -1 template: chat: &template | Instruct: {{.Input}} Output: completion: *template

usage: | To use this model, interact with the API (in another terminal) with curl for instance: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "phi-2", "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}] }' `

jtwolfe commented 4 months ago

the rocm docker image does appear to load the model however there is a grpc error that I have encountered that causes the call to terminate before inference, i am moving to 22.04 with rocm 6.0.0 on the host make sure there are no version compatibility issues.

Note: the new vulkan implementation of llama.cpp seems to work flawlessly

derzahla commented 2 months ago

Im trying to work on the hipblas version but I am confused on where the Dockerfiles are located that are used to generate the latest images such as "quay.io/go-skynet/local-ai:master-hipblas" . One thing I noticed is that the latest hipblas images are still using rocm v6.0.0 while v6.0.3 is now out. But I have been unable to locate a Dockerfile in the git repo that is installing any version of rocm. So it would appear the Dockerfle being used is hosted elsewhere?

Would appreciate if someone could point me to the latest Dockerfile being used to generate the hipblas images. Thank you

jtwolfe commented 2 months ago

Im trying to work on the hipblas version but I am confused on where the Dockerfiles are located that are used to generate the latest images such as "quay.io/go-skynet/local-ai:master-hipblas" . One thing I noticed is that the latest hipblas images are still using rocm v6.0.0 while v6.0.3 is now out. But I have been unable to locate a Dockerfile in the git repo that is installing any version of rocm. So it would appear the Dockerfle being used is hosted elsewhere?

Would appreciate if someone could point me to the latest Dockerfile being used to generate the hipblas images. Thank you

newer does not equal better, this said, x.x.Y versions of Y variation are usually hotfixes and usually only apply to some very specific edge cases, can you clarify any issues you may have with 6.0.0 that are resolved with 6.0.3?

jtwolfe commented 2 months ago

the rocm docker image does appear to load the model however there is a grpc error that I have encountered that causes the call to terminate before inference, i am moving to 22.04 with rocm 6.0.0 on the host make sure there are no version compatibility issues.

Note: the new vulkan implementation of llama.cpp seems to work flawlessly

I think I just discovered the cause of my issue... I am running my Radeon VII for this workload this would be a gfx906 device presently i find only GPU_TARGETS ?= gfx900,gfx90a,gfx1030,gfx1031,gfx1100 in the makefile regarding this gfx900 is not supported for rocm v5.>> or v6.0.0

I have yet to test if a tailored build including gfx906 will work but this may be a good candidate for inclusion in the next hipblas build details

for reference currently under 6.0.0 the following llbm targets are supported gfx942,gfx90a,gfx908,gfx906,gfx1100,gfx1030 I would not for clarity that the gfx906 target is depreciated for the instinct MI50 but not for the radeon pro vii or the radeon vii, add to this that the instinct MI25 is the only gfx900 card and is noted as no longer supported, while I do think we should keep gfx900 in place for as long as possible it may impact future builds

I may not have time to test an amendment to the GPU_TARGETS for the next few weeks (I only have like 2 hrs free today and after building my gpu into a single node k8s cluster I need to configure a local container registry before I can test any custom builds :( )

@fenfir might you be able to test this?

jtwolfe commented 2 months ago

the rocm docker image does appear to load the model however there is a grpc error that I have encountered that causes the call to terminate before inference, i am moving to 22.04 with rocm 6.0.0 on the host make sure there are no version compatibility issues. Note: the new vulkan implementation of llama.cpp seems to work flawlessly

I think I just discovered the cause of my issue... I am running my Radeon VII for this workload this would be a gfx906 device presently i find only GPU_TARGETS ?= gfx900,gfx90a,gfx1030,gfx1031,gfx1100 in the makefile regarding this gfx900 is not supported for rocm v5.>> or v6.0.0

I have yet to test if a tailored build including gfx906 will work but this may be a good candidate for inclusion in the next hipblas build details

for reference currently under 6.0.0 the following llbm targets are supported gfx942,gfx90a,gfx908,gfx906,gfx1100,gfx1030 I would not for clarity that the gfx906 target is depreciated for the instinct MI50 but not for the radeon pro vii or the radeon vii, add to this that the instinct MI25 is the only gfx900 card and is noted as no longer supported, while I do think we should keep gfx900 in place for as long as possible it may impact future builds

I may not have time to test an amendment to the GPU_TARGETS for the next few weeks (I only have like 2 hrs free today and after building my gpu into a single node k8s cluster I need to configure a local container registry before I can test any custom builds :( )

@fenfir might you be able to test this?

ok so fyi current master-hipblas-ffmpeg-core image with GPU_TARGETS=gfx906 does not build

[  0%] Building C object CMakeFiles/ggml.dir/ggml.c.o
[  1%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[  1%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[  2%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
[  2%] Building CXX object CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o
clang++: error: invalid target ID 'gfx903'; format is a processor name followed by an optional colon-delimited list of features followed by an enable/disable sign (e.g., 'gfx908:sramecc+:xnack-')
gmake[4]: *** [CMakeFiles/ggml.dir/build.make:132: CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o] Error 1
2024-04-07T15:31:29.842216496+10:00 gmake[4]: Leaving directory '/build/backend/cpp/llama/llama.cpp/build'
gmake[3]: *** [CMakeFiles/Makefile2:842: CMakeFiles/ggml.dir/all] Error 2
gmake[3]: Leaving directory '/build/backend/cpp/llama/llama.cpp/build'
2024-04-07T15:31:29.842808442+10:00 gmake[2]: *** [Makefile:146: all] Error 2
2024-04-07T15:31:29.842836792+10:00 gmake[2]: Leaving directory '/build/backend/cpp/llama/llama.cpp/build'
make[1]: *** [Makefile:75: grpc-server] Error 2
make[1]: Leaving directory '/build/backend/cpp/llama'
make: *** [Makefile:517: backend/cpp/llama/grpc-server] Error 2

EDIT: 'waaaaaaiiiiit a second' I think im retarded... EDIT2: yep im definately retarded, setting the environment var GPU_TARGETS=gfx906 worked fine, not i just need to get my model and context right <3 @mudler @fenfir <3 can we pls get gfx906 added to the default targets pls

jtwolfe commented 2 months ago

@Expro take a look at my previous posts, maybe they will help you solve this, ping me if you like, maybe I can help

jtwolfe commented 2 months ago

@mudler before i spend the time, are there any immediate plans for expanded k8s docs or AMD specific docs?

mudler commented 2 months ago

@mudler before i spend the time, are there any immediate plans for expanded k8s docs or AMD specific docs?

Hey @jtwolfe , thanks for deep diving into this, I don't have an AMD card to test things out so I refrained to write documentation that I couldn't test with. Any help on that area is greatly appreciated.

jtwolfe commented 2 months ago

@mudler before i spend the time, are there any immediate plans for expanded k8s docs or AMD specific docs?

Hey @jtwolfe , thanks for deep diving into this, I don't have an AMD card to test things out so I refrained to write documentation that I couldn't test with. Any help on that area is greatly appreciated.

ack. I'll do my best to try and get some of AMD brethren to test some more edge cases so we can give some more details on modern cards but I will send a PR up for docs when I get time.

jamiemoller commented 2 months ago

Im trying to work on the hipblas version but I am confused on where the Dockerfiles are located that are used to generate the latest images such as "quay.io/go-skynet/local-ai:master-hipblas" . One thing I noticed is that the latest hipblas images are still using rocm v6.0.0 while v6.0.3 is now out. But I have been unable to locate a Dockerfile in the git repo that is installing any version of rocm. So it would appear the Dockerfle being used is hosted elsewhere? Would appreciate if someone could point me to the latest Dockerfile being used to generate the hipblas images. Thank you

newer does not equal better, this said, x.x.Y versions of Y variation are usually hotfixes and usually only apply to some very specific edge cases, can you clarify any issues you may have with 6.0.0 that are resolved with 6.0.3?

i hope you're using containers \winkyface

it appears that the AMD advice regarding 'dowards compatability" is correct ie. I am currently running 6.0.2 on my server and while the container works on 6.0.0 and have yet to have any issues

if you wish to keep your server driver up to date as long as the major version is the same between the host and the container and the host minor version is greater than the containers then you should not have any problems

really there should not be an issue in either direction with minor version updates however there is the potential for baser operations to be invalidated accidentally via implementation in whatever program that makes the calls. this said I would still recommend keeping to the AMD standard

i would recommend for comparability sake that we keep the container rocm version at 6.0.0 until such time that there is a breaking change that stops this backwards compatibility

jamiemoller commented 2 months ago

Im trying to work on the hipblas version but I am confused on where the Dockerfiles are located that are used to generate the latest images such as "quay.io/go-skynet/local-ai:master-hipblas" . One thing I noticed is that the latest hipblas images are still using rocm v6.0.0 while v6.0.3 is now out. But I have been unable to locate a Dockerfile in the git repo that is installing any version of rocm. So it would appear the Dockerfle being used is hosted elsewhere?

Would appreciate if someone could point me to the latest Dockerfile being used to generate the hipblas images. Thank you

@derzahla i would not recommend building it from scratch. grab the hipblas image and pass it the REBUILD=true var also if you have issues after the rebuild check the llvm target for your card and pass in GPU_TARGETS=gfx$WHATEVER

find the llvm target for your gpu https://llvm.org/docs/AMDGPUUsage.html#processors then check the comparability with rocm https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.0.0/reference/system-requirements.html

this should work, im lucky to have a card thats directly referenced on the rocm supported gpu list but I expect that any chip associated with the llvm target should work (ie gfx1030 includes rx6800, rx6800xt and rx6900xt but according to amd If a GPU is not listed on this table, It’s not officially supported by AMD.)

derzahla commented 2 months ago

newer does not equal better, this said, x.x.Y versions of Y variation are usually hotfixes and usually only apply to some very specific edge cases, can you clarify any issues you may have with 6.0.0 that are resolved with 6.0.3?

There still does not seem to be any release notes out for 6.0.3, but since I have a gfx1103 which isn't officially supported up through 6.0.2, I was hoping maybe it was added in 6.0.3.

However, I have had success with ollama by setting "HSA_OVERRIDE_GFX_VERSION=11.0.2" ( on rocm 6.0.2 & 6.0.3, at least)

I initially tried setting REBUILD=true and it didn't help. That's why I was trying to find the actual Dockerfile used to generate the hipblas registry containers. I can try running with REBUILD=true again and post details of the results

jamiemoller commented 2 months ago

newer does not equal better, this said, x.x.Y versions of Y variation are usually hotfixes and usually only apply to some very specific edge cases, can you clarify any issues you may have with 6.0.0 that are resolved with 6.0.3?

There still does not seem to be any release notes out for 6.0.3, but since I have a gfx1103 which isn't officially supported up through 6.0.2, I was hoping maybe it was added in 6.0.3.

However, I have had success with ollama by setting "HSA_OVERRIDE_GFX_VERSION=11.0.2" ( on rocm 6.0.2 & 6.0.3, at least)

I initially tried setting REBUILD=true and it didn't help. That's why I was trying to find the actual Dockerfile used to generate the hipblas registry containers. I can try running with REBUILD=true again and post details of the results

hmmmm

https://www.reddit.com/r/ROCm/comments/1b36sjj/support_for_gfx1103/ there is a note here indicating that maybe if compiled for gfx1100 there may be a path but from what i see the gfx1103 is an integrated graphics solution/mGPU (is that the case for you?).

if it is im inclined to think that this may be a harder problem than you'd like. as I understand it there are architectural changes regarding memory management for AMD APU that may proclude if from being easily compilable with rocm.

have you had a look at vllm with rocm? https://docs.vllm.ai/en/latest/getting_started/amd-installation.html you may have some success with a single inference tool? (beware I have had it eat >70GB of memory during the docker build for the rocm supporting image)

personally i would love to see a implementation of localai with vulkan however this is all dependent on upstream project support. and for this i expect that there may be a considerable amount of 'hackery' and 'overhead' related losses that may make this a considerable time sink for developers :(

PS. if this is a mobile gpu I would ask what the cost/benefit for this looks like? while it would be good for people without access to performant machines I expect a better solution would be to find an eGPU chassis on ebay and fill it with a cheap rx6600/rx7600 or the like.

PPS. I have used LMStudio on my Legion GO with its Z1 and while it did work 'sometimes' (memory allocation I think) i did not get any better performance than doing straight CPU inference on of my 7950X systems (~12+-5 tokens/s)

derzahla commented 2 months ago

@jamiemoller Interestingly the LLM function seems to work if I recompile for gfx1100 as you mentioned and change HSA_OVERRIDE_GFX_VERSION to 11.0.0. I wonder if gfx1102 and HSA_OVERRIDE_GFX_VERSION=11.0.2" would work with an upgraded rocm to >= 6.0.2.

Yes my gfx1103 is an iGPU but its not mobile. I have an Radeon 8600G in an ATX case, so I can upgrade to a more powerful GPU easy enough but I wanted to push the limits of this iGPU first and see if it would be sufficient.

I have not tried vLLM but thanks for making me aware of it. ollama works very nicely for LLM functionality. One of things I was looking forward to with localai is AI Art integrations with stable diffusion and tinydream. Stablediffusion still pukes on the rebuilt container with:

7:46PM DBG GRPC(stablediffusion_assets-127.0.0.1:35289): stderr /tmp/localai/backend_data/backend-assets/grpc/stablediffusion: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory

So again, it would be nice if someone could point me to the Dockerfile's used to build the hipblas images so I could modify them for my needs.

jtwolfe commented 2 months ago

@derzahla

last question first:

im pretty sure all of the image configurations are set in the github actions workflows in the repo im more of a gitlab ci guy myself but it looks pretty simple, just check out https://github.com/mudler/LocalAI/tree/master/.github/workflows if you take a look at image.yml image_build.yml image-pr.yml and release.yaml you will find the all the details regarding overrides for the build process

good to know that there is a workaround for "hotfix'ed" target versions, strange tho that youre having the sd issue I'm currently looking into imagegen myself but havent had any luck so far from memory most image gen implementations use rocm 5.x and use a custom version of some python library i cant remember the name of that emulates cuda enablement (pytorch)

im working my way though the feature list now to test for docs

so far ive tested working textgen (gpu) tts (gpu) \ i think piper hit lik 5% of my gpu for about 2.5s to generate the first 20% of the turbo encabulator talk sst (cpu) / whisper is fast on anything vision (gpu)

embeddings - was doing something funny because transformers diffusion - \shrug - still investigating

edit; for some reason diffusers-rocm.yml does not note the --extra-index-url as per pytorch docs https://pytorch.org/get-started/locally/ unsure if this has any impact as /rocm6.0/* forwards to /* in the same index url

edit2: i have found and replicated your limomp.so issue I'm having a hard time whats calling it tho also no the easy 'just install the library' solution doesn't seem to work atm i think there another dependency somewhere thats expecting it as a prereq

2024-04-13T15:55:25.964632811+10:00 5:55AM DBG GRPC(stablediffusion_assets-127.0.0.1:41555): stderr /tmp/localai/backend_data/backend-assets/grpc/stablediffusion: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory

edit3; so it appears that the libomp.so library issue only occurs on SD in cpu mode (ai the aio/cpu/image-gen.yaml) on using the aio/gpu-8g/image-gen.yaml there appears another error which results in a connection error from grpc

7:13AM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:40605): stderr /build/backend/python/diffusers/run.sh: line 13: activate: No such file or directory
7:13AM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:40605): stderr /build/backend/python/diffusers/run.sh: line 19: python: command not found

this specifically refers to

ln 13 source activate diffusers

ln 19 python $DIR/backend_diffusers.py $@

i have found that /opt/conda is straight up not availible bingo so it looks like theres some python stuff missing then, so what next? I have switched to the non-core image since if memory serves core removes some python related things to slim down the image.

now my problem is downloading a 20gb image at 'quay speed'

edit4: ... yep

non-core image fixes gpu requirements for diffusion, cpu didnt work for some reason still (i expect its the model)
some other minor build weirdness has disappeared too so i expect things should be smooth-ish for most of the python implementations
still haven't tested embeddings
can we re-add the python elements post hoc? rather than having a huge image?

image so large i have to move my models to another disk :|

jtwolfe commented 2 months ago

@derzahla I think there might be some reduced featureset for igpu (my bets on memory adjacent) that is a bit of a sticking point atm in drivers. news was that rocm5.7? was dropping support of a bunch of cards soon so im really not sure how much compatibility were going to get with older chips with non-"ai-specific-architecture".

cheeky soln - if you can make vgpu work on your host system just find some good tools and run the independently, automatic1111 and oobagooba come to mind ;) split the api with a proxy

jtwolfe commented 2 months ago

@derzahla i apologize but I was incorrect there is still an issue with SD i see in my testing it appears that as I was testing the aio models i did not realise that the cpu and gpu examples actually use different backends. the functional gpu model makes use of diffusers meanwhile the cpu mpdel makes use of stablediffusion, presently I trust the diffusers backend more than the stablediffusion one as it seems that sd is just a prebuilt repo and executes it entirely separately to the diffusers backend, as such the bug is probably in the upstream repo from @EdVince

I am inclined to ask @mudler if he is aware of any reasons why this may not be working? (also if youre listening @mudler i seem to recall my testing around v2.0 on cpu would jettison unused models if there was not enough memory, then complete loading the model, this is not working for gpu atm :| any ideas? like @derzahla noted about SD could it possibly be the rebuild)

But either way the gpu accelerated model using the diffusers backend seems to be working without issue

its also worth noting that the intel solution has a different configuration again so im unsure if that will work either

I swear my headstone will read still testing

bunder2015 commented 2 months ago

Hi, I have a Radeon VII and was able to get it working on localai. I did have to make some tweaks to get it to build and use gfx906 however...

# docker-compose.yaml
    image: quay.io/go-skynet/local-ai:v2.12.4-aio-gpu-hipblas
    environment:
      - DEBUG=true
      - REBUILD=true
      - BUILD_TYPE=hipblas
      - GPU_TARGETS=gfx906
    devices:
      - /dev/dri
      - /dev/kfd

Cheers

jamiemoller commented 2 months ago

@bunder2015 when you say 'it' do you mean the container or the 'stablediffusion' backend. also would you mind listing any of the aio defined models and if they offload to gpu? any details you can confirm with testing would be appreciated.

also also; i have had issues with using the 'cloned-voice' backend. it is currently giving me an error due to a missing opencl library this is in the same fashion s the missing as the libopm.so issue for sd.

any detail would be appreciated

also fyi i am using the GO_TAGS="stablediffusion tinydream tts" and DEBUG="true" for my rebuild of the 'non-core' and 'non-aio' 'latest' master image

bunder2015 commented 2 months ago

Sorry for the confusion, I meant that I was able to get the localai container with gpu offloading to work.

I tried the following models: bakllava gpt-4 hermes-2-pro-mistral llava-1.6-mistral mixtral-instruct phi-2 stablediffusion tts (gpt-4 seems to also be an alias to hermes-2-pro-mistral)

To my knowledge, they all offloaded to the gpu. I had issues getting them to offload at first (some error about not being able to find tensiles?)... but I tried ollama's docker container and noticed it had the same devices setup, and offload was working there... so I tried it here and offload started working here as well.

It appears I also set GO_TAGS="stablediffusion tts" in .env... I think I had issues adding tinydream there, although Dockerfile has all three set. :shrug:

I hope that helps some, let me know if you need more... Cheers

edit: I tried bark tts and unfortunately it's not offloaded... piper seems to be, but it doesn't support japanese unfortunately.

jtwolfe commented 2 months ago

@bunder2015 thanks for the details

I've been adding the f16 and ngpu flags to things to test for 'easy' gpu use and its been kinda hit and miss. eg. vall-e-x for some reason will recognize the ngpu flag but not the fp16 flag and when i try to use the clone process i get some rhylean audio and no gpu offloading, /shrug

perhaps since its a python tool it needs a cuda flag too? maybe??

i do seem to be able to at least run the clone tool now thanks to the change to the versioned image rather than master (like you im using v2.12.4 (still not using the aio image tho) ) still no luck tho with the video gen however, everything loads to the gpu correctly however it still wants an opencv python package

7:04AM DBG GRPC(damo-vilab/text-to-video-ms-1.7b-127.0.0.1:41305): stderr export_to_video requires the OpenCV library but it was not found in your environment. You can install it with pip: `pip

for some reason also ive had a weird issue with the --output flag for curl where for some reason the generated audio files aren't being exported back full and i just get a 504 error output to a .wav file (knowing my luck ill figure that one out in another day)

regarding cloning i have a feeling that it will be easier to just train my own .onyx for piper but we'll see

bunder2015 commented 1 month ago

Hi, ignore my (previously deleted) message about not being able to build 2.16.0, I was able to get it to build by removing GO_TAGS from my docker-compose file. But I'm still having issues, now with diffusers/dreamshaper not working, it says I don't have the nvidia drivers loaded... log file

So I added back GO_TAGS, but removed the quotes, because I started seeing stuff like GO_TAGS=""stablediffusion tinydream tts"", which if I'm not mistaken, would evaluate to blank ""... and I'm seeing this kind of thing in all sorts of places during the build phase (sometimes even with unquoted variables in the docker-compose file). eg: -DAMDGPU_TARGETS=""gfx906"" Any time a quoted variable gets quoted itself, it could just be unsetting itself and leaving cruft in the command line.

I don't think half of this stuff is building right because of it, and I don't know how much I should be taking out of the Makefile because I can't see some of the build phase. I can see slivers of build output in docker compose, up until it stops showing the build output in the api-builder and api-stage6 phases.

NGL I'm not a docker expert, but it would be really helpful if I could see the entire output. @mudler sorry for the ping, but do you have any ideas? Thanks

edit: While I'm asking about Makefile stuff, I don't think a lot of the build phase honours BUILD_PARALLELISM.

edit: I'm gonna start pulling all the images between 2.15.0 and 2.16.0 and see what broke it.

bunder2015 commented 1 month ago

Okay, I think I narrowed it down... sha-e676809-hipblas-ffmpeg works, sha-cf513ef-hipblas-ffmpeg doesn't. So that give us:

cf513ef Update openai-functions.md
9e8b344 Update openai-functions.md
88d0aa1 docs: update function docs
9b09eb0 build: do not specify a BUILD_ID by default (#2284)
4db41b7 models(gallery): add aloe (#2283)
28a421c (origin/build_tag) feat: migrate python backends from conda to uv (#2215)

My bets are on the conda to uv change, but I'm not sure how to debug or fix it at the moment.

bunder2015 commented 1 month ago

I've been trying to add verbose flags to various Makefiles, but it looks like docker compose is ignoring everything that isn't in .env, Dockerfile or docker-compose.yaml... I can't seem to get it to stop adding -s to make. :shrug: Screenshot_20240526_095510

cryptk commented 1 month ago

@bunder2015 can you please give me some exact replication instructions? What exactly can I do to replicate your issue? If this happens when you run a docker-compose, can you please provide the contents of that docker-compose file?

bunder2015 commented 1 month ago

Hi, thanks for the reply... I think this should be sufficient to replicate...

git clone https://github.com/mudler/localai
cd localai
git checkout -b v2.16.0 e0187c2a1a4cde837398ada217d0ad161b7976d6

version: '3.6'

services:
  api:
    # See https://localai.io/basics/getting_started/#container-images for
    # a list of available container images (or build your own with the provided Dockerfile)
    # Available images with CUDA, ROCm, SYCL
    # Image list (quay.io): https://quay.io/repository/go-skynet/local-ai?tab=tags
    # Image list (dockerhub): https://hub.docker.com/r/localai/localai
    image: quay.io/go-skynet/local-ai:v2.16.0-hipblas-ffmpeg
    build:
      context: .
      dockerfile: Dockerfile
      args:
      - IMAGE_TYPE=extras
      - BASE_IMAGE=ubuntu:22.04
    ports:
      - 8080:8080
    env_file:
      - .env
    environment:
      - MODELS_PATH=/models
      - DEBUG=true
      - REBUILD=true
      - BUILD_TYPE=hipblas
      - GPU_TARGETS=gfx906
      - GO_TAGS=stablediffusion tinydream tts
      - BUILD_PARALLELISM=16
      - LOCALAI_THREADS=16
      - LOCALAI_UPLOAD_LIMIT=500
    devices:
      - /dev/dri
      - /dev/kfd
    volumes:
      - ./models:/models:cached
      - ./images/:/tmp/generated/images/
    #command:
    # Here we can specify a list of models to run (see quickstart https://localai.io/basics/getting_started/#running-models )
    # or an URL pointing to a YAML configuration file, for example:
    # - https://gist.githubusercontent.com/mudler/ad601a0488b497b69ec549150d9edd18/raw/a8a8869ef1bb7e3830bf5c0bae29a0cce991ff8d/phi-2.yaml
    #- phi-2

docker compose up

Wait for the container to build, install dreamshaper, and try to use it to generate an image.

I'm using a Radeon VII but if I understand the commit diff correctly, any AMD GPU should suffice with the right GPU_TARGETS value.

If you want me to do some further testing, please let me know. Cheers

bunder2015 commented 1 month ago

@cryptk @mudler I gave sha-ba984c7-hipblas-ffmpeg a try but unfortunately I'm still getting the 'no nvidia driver' error still. Let me know when you want me to try again. :pray: Cheers

Hideman85 commented 1 month ago

The official docker AIO hipblas image isnt working as well, seem to completely fail with the grpc server and loop through all backends...

Logs here (tried with pre-built gpt4)

``` docker run -p 8080:8080 --rm -v ./Documents/AIModels/:/build/models -ti localai/localai:latest-aio-gpu-hipblas ─╯ ===> LocalAI All-in-One (AIO) container starting... NVIDIA GPU detected /aio/entrypoint.sh: line 52: nvidia-smi: command not found NVIDIA GPU detected, but nvidia-smi is not installed. GPU acceleration will not be available. AMD GPU detected Non-NVIDIA GPU detected. Specific GPU memory size detection is not implemented. [...] 10:29AM INF core/startup process completed! 10:29AM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080 10:30AM INF Success ip=127.0.0.1 latency="46.398µs" method=GET status=200 url=/readyz 10:31AM INF Success ip=127.0.0.1 latency="36.361µs" method=GET status=200 url=/readyz 10:32AM INF Success ip=127.0.0.1 latency="55.766µs" method=GET status=200 url=/readyz 10:32AM INF Success ip=172.17.0.1 latency="409.124µs" method=GET status=200 url=/chat/gpt-4 10:32AM INF Success ip=172.17.0.1 latency="30.469µs" method=GET status=200 url=/static/assets/highlightjs.css 10:32AM INF Success ip=172.17.0.1 latency="70.635µs" method=GET status=200 url=/static/assets/highlightjs.js 10:32AM INF Success ip=172.17.0.1 latency="34.005µs" method=GET status=200 url=/static/general.css 10:32AM INF Success ip=172.17.0.1 latency="21.05µs" method=GET status=200 url=/static/assets/marked.js 10:32AM INF Success ip=172.17.0.1 latency="46.679µs" method=GET status=200 url=/static/assets/alpine.js 10:32AM INF Success ip=172.17.0.1 latency="18.545µs" method=GET status=200 url=/static/assets/purify.js 10:32AM INF Success ip=172.17.0.1 latency="16.452µs" method=GET status=200 url=/static/assets/font2.css 10:32AM INF Success ip=172.17.0.1 latency="89.662µs" method=GET status=200 url=/static/assets/font1.css 10:32AM INF Success ip=172.17.0.1 latency="8.416µs" method=GET status=200 url=/static/assets/tw-elements.css 10:32AM INF Success ip=172.17.0.1 latency="5.751µs" method=GET status=200 url=/static/assets/tailwindcss.js 10:32AM INF Success ip=172.17.0.1 latency="7.324µs" method=GET status=200 url=/static/assets/fontawesome/css/fontawesome.css 10:32AM INF Success ip=172.17.0.1 latency="4.709µs" method=GET status=200 url=/static/chat.js 10:32AM INF Success ip=172.17.0.1 latency="18.245µs" method=GET status=200 url=/static/assets/fontawesome/css/solid.css 10:32AM INF Success ip=172.17.0.1 latency="17.534µs" method=GET status=200 url=/static/assets/htmx.js 10:32AM INF Success ip=172.17.0.1 latency="96.185µs" method=GET status=200 url=/static/assets/fontawesome/css/brands.css 10:32AM INF Success ip=172.17.0.1 latency=1.129971ms method=POST status=200 url=/v1/chat/completions 10:32AM INF Trying to load the model 'b5869d55688a529c3738cb044e92c331' with the backend '[llama-cpp llama-ggml gpt4all llama-cpp-fallback rwkv whisper piper stablediffusion huggingface bert-embeddings /build/backend/python/diffusers/run.sh /build/backend/python/bark/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/vllm/run.sh /build/backend/python/exllama/run.sh /build/backend/python/coqui/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/mamba/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/transformers/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/petals/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/rerankers/run.sh]' 10:32AM INF [llama-cpp] Attempting to load 10:32AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-cpp 10:32AM INF GPU device found but no CUDA backend present 10:32AM INF [llama-cpp] attempting to load with AVX2 variant 10:32AM INF [llama-cpp] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF 10:32AM INF [llama-ggml] Attempting to load 10:32AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-ggml 10:32AM INF [llama-ggml] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 10:32AM INF [gpt4all] Attempting to load 10:32AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend gpt4all 10:32AM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 10:32AM INF [llama-cpp-fallback] Attempting to load 10:32AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-cpp-fallback 10:32AM INF [llama-cpp-fallback] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF 10:32AM INF [rwkv] Attempting to load 10:32AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend rwkv 10:32AM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF 10:32AM INF [whisper] Attempting to load 10:32AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend whisper 10:32AM INF [whisper] Fails: could not load model: rpc error: code = Unknown desc = unable to load model 10:32AM INF [piper] Attempting to load 10:32AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend piper 10:32AM INF [piper] Fails: could not load model: rpc error: code = Unknown desc = unsupported model type /build/models/b5869d55688a529c3738cb044e92c331 (should end with .onnx) 10:32AM INF [stablediffusion] Attempting to load 10:32AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend stablediffusion 10:33AM INF Success ip=127.0.0.1 latency="42.134µs" method=GET status=200 url=/readyz 10:33AM ERR failed starting/connecting to the gRPC service error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:36157: connect: connection refused\"" 10:33AM INF [stablediffusion] Fails: grpc service not ready 10:33AM INF [huggingface] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend huggingface 10:33AM INF [huggingface] Fails: could not load model: rpc error: code = Unknown desc = no huggingface token provided 10:33AM INF [bert-embeddings] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend bert-embeddings 10:33AM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 10:33AM INF [/build/backend/python/diffusers/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/diffusers/run.sh 10:33AM INF [/build/backend/python/diffusers/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/diffusers/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/bark/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/bark/run.sh 10:33AM INF [/build/backend/python/bark/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/bark/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/vall-e-x/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/vall-e-x/run.sh 10:33AM INF [/build/backend/python/vall-e-x/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/vall-e-x/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/vllm/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/vllm/run.sh 10:33AM INF [/build/backend/python/vllm/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/vllm/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/exllama/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/exllama/run.sh 10:33AM INF [/build/backend/python/exllama/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/exllama/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/coqui/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/coqui/run.sh 10:33AM INF [/build/backend/python/coqui/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/coqui/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/openvoice/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/openvoice/run.sh 10:33AM INF [/build/backend/python/openvoice/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/openvoice/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/mamba/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/mamba/run.sh 10:33AM INF [/build/backend/python/mamba/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/mamba/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/sentencetransformers/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/sentencetransformers/run.sh 10:33AM INF [/build/backend/python/sentencetransformers/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/sentencetransformers/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/transformers/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/transformers/run.sh 10:33AM INF [/build/backend/python/transformers/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/transformers/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/sentencetransformers/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/sentencetransformers/run.sh 10:33AM INF [/build/backend/python/sentencetransformers/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/sentencetransformers/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/exllama2/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/exllama2/run.sh 10:33AM INF [/build/backend/python/exllama2/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/exllama2/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/petals/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/petals/run.sh 10:33AM INF [/build/backend/python/petals/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/petals/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/autogptq/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/autogptq/run.sh 10:33AM INF [/build/backend/python/autogptq/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/autogptq/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/transformers-musicgen/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/transformers-musicgen/run.sh 10:33AM INF [/build/backend/python/transformers-musicgen/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/transformers-musicgen/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/parler-tts/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/parler-tts/run.sh 10:33AM INF [/build/backend/python/parler-tts/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/parler-tts/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS 10:33AM INF [/build/backend/python/rerankers/run.sh] Attempting to load 10:33AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend /build/backend/python/rerankers/run.sh 10:33AM INF [/build/backend/python/rerankers/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/rerankers/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS ```

bunder2015 commented 1 month ago

Good morning @Hideman85, how did you set up localai? I would recommend using the docker compose method... Without the devices block, it won't offload anything to the AMD GPU.

That said, I haven't had any issues with text models. Let us know how things turn out. Cheers

Hideman85 commented 1 month ago

Same thing happening for me, try using gpt4 or llama3 7B none works with hipblas

Logs

``` api-1 | I local-ai build info: api-1 | I BUILD_TYPE: hipblas api-1 | I GO_TAGS: stablediffusion tinydream tts api-1 | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=v2.16.0" -X "github.com/go-skynet/LocalAI/internal.Commit=e0187c2a1a4cde837398ada217d0ad161b7976d6" api-1 | CGO_LDFLAGS="-O3 --rtlib=compiler-rt -unwindlib=libgcc -lhipblas -lrocblas --hip-link -L/opt/rocm/lib/llvm/lib" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.16.0" -X "github.com/go-skynet/LocalAI/internal.Commit=e0187c2a1a4cde837398ada217d0ad161b7976d6"" -tags "stablediffusion tinydream tts" -o local-ai ./ api-1 | 12:16PM INF loading environment variables from file envFile=.env api-1 | 12:16PM DBG Setting logging to debug api-1 | 12:16PM INF Starting LocalAI using 16 threads, with models path: /models api-1 | 12:16PM INF LocalAI version: v2.16.0 (e0187c2a1a4cde837398ada217d0ad161b7976d6) api-1 | 12:16PM DBG CPU capabilities: [3dnowprefetch abm adx aes amd_lbr_v2 aperfmperf apic arat avx avx2 avx512_bf16 avx512_bitalg avx512_vbmi2 avx512_vnni avx512_vpopcntdq avx512bw avx512cd avx512dq avx512f avx512ifma avx512vbmi avx512vl bmi1 bmi2 bpext cat_l3 cdp_l3 clflush clflushopt clwb clzero cmov cmp_legacy constant_tsc cpb cppc cpuid cqm cqm_llc cqm_mbm_local cqm_mbm_total cqm_occup_llc cr8_legacy cx16 cx8 de decodeassists erms extapic extd_apicid f16c flush_l1d flushbyasid fma fpu fsgsbase fxsr fxsr_opt gfni ht hw_pstate ibpb ibrs ibrs_enhanced ibs invpcid irperf lahf_lm lbrv lm mba mca mce misalignsse mmx mmxext monitor movbe msr mtrr mwaitx nonstop_tsc nopl npt nrip_save nx ospke osvw overflow_recov pae pat pausefilter pclmulqdq pdpe1gb perfctr_core perfctr_llc perfctr_nb perfmon_v2 pfthreshold pge pku pni popcnt pse pse36 rapl rdpid rdpru rdrand rdseed rdt_a rdtscp rep_good sep sha_ni skinit smap smca smep ssbd sse sse2 sse4_1 sse4_2 sse4a ssse3 stibp succor svm svm_lock syscall tce topoext tsc tsc_scale umip user_shstk v_spec_ctrl v_vmsave_vmload vaes vgif vmcb_clean vme vmmcall vnmi vpclmulqdq wbnoinvd wdt x2apic x2avic xgetbv1 xsave xsavec xsaveerptr xsaveopt xsaves] api-1 | 12:16PM DBG GPU count: 2 api-1 | 12:16PM DBG GPU: card #0 @0000:01:00.0 -> driver: 'nvidia' class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'unknown' api-1 | 12:16PM DBG GPU: card #1 @0000:c4:00.0 -> driver: 'amdgpu' class: 'Display controller' vendor: 'Advanced Micro Devices, Inc. [AMD/ATI]' product: 'unknown' api-1 | 12:16PM INF Preloading models from /models api-1 | 12:16PM DBG Model: gpt-4 (config: {PredictionOptions:{Model:b5869d55688a529c3738cb044e92c331 Language: N:0 TopP:0xc000b2ecb8 TopK:0xc000b2ecc0 Temperature:0xc000b2ecc8 Maxtokens:0xc000b2ecf8 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000b2ecf0 TypicalP:0xc000b2ece8 Seed:0xc000b2ed10 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4 F16:0xc000b2ecb0 Threads:0xc000b2eca8 Debug:0xc000b2ed08 Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat:{{.Input -}} api-1 | <|im_start|>assistant api-1 | ChatMessage:<|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}} api-1 | {{- if .FunctionCall }} api-1 | api-1 | {{- else if eq .RoleName "tool" }} api-1 | api-1 | {{- end }} api-1 | {{- if .Content}} api-1 | {{.Content }} api-1 | {{- end }} api-1 | {{- if .FunctionCall}} api-1 | {{toJson .FunctionCall}} api-1 | {{- end }} api-1 | {{- if .FunctionCall }} api-1 | api-1 | {{- else if eq .RoleName "tool" }} api-1 | api-1 | {{- end }}<|im_end|> api-1 | Completion:{{.Input}} api-1 | Edit: Functions:<|im_start|>system api-1 | You are a function calling AI model. api-1 | Here are the available tools: api-1 | api-1 | {{range .Functions}} api-1 | {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }} api-1 | {{end}} api-1 | api-1 | You should call the tools provided to you sequentially api-1 | Please use XML tags to record your reasoning and planning before you call the functions as follows: api-1 | api-1 | {step-by-step reasoning and plan in bullet points} api-1 | api-1 | For each function call return a json object with function name and arguments within XML tags as follows: api-1 | api-1 | {"arguments": , "name": } api-1 | <|im_end|> api-1 | {{.Input -}} api-1 | <|im_start|>assistant UseTokenizerTemplate:false JoinChatMessagesByCharacter:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:true GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:true NoMixedFreeString:false NoGrammar:false Prefix:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex: JSONRegexMatch:[(?s)(.*?) (?s)(.*?)] ReplaceFunctionResults:[{Key:(?s)^[^{\[]* Value:} {Key:(?s)[^}\]]*$ Value:} {Key:'([^']*?)' Value:_DQUOTE_${1}_DQUOTE_} {Key:\\" Value:__TEMP_QUOTE__} {Key:' Value:'} {Key:_DQUOTE_ Value:"} {Key:__TEMP_QUOTE__ Value:"} {Key:(?s).* Value:}] ReplaceLLMResult:[{Key:(?s).* Value:}] FunctionName:true} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000b2ece0 MirostatTAU:0xc000b2ecd8 Mirostat:0xc000b2ecd0 NGPULayers:0xc000b2ed00 MMap:0xc000b2ec58 MMlock:0xc000b2ed09 LowVRAM:0xc000b2ed09 Grammar: StopWords:[<|im_end|> <|eot_id|> <|end_of_text|>] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000b2ec60 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}) api-1 | 12:17PM DBG Stream request received api-1 | 12:17PM INF Success ip=172.18.0.1 latency=1.284947ms method=POST status=200 url=/v1/chat/completions api-1 | 12:17PM DBG Sending chunk: {"created":1716812185,"object":"chat.completion.chunk","id":"2af7725f-1353-4598-9ca8-2c5d5cfec26a","model":"gpt-4","choices":[{"index":0,"finish_reason":"","delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} api-1 | api-1 | 12:17PM DBG Loading from the following backends (in order): [llama-cpp llama-ggml gpt4all llama-cpp-fallback rwkv stablediffusion whisper piper tinydream huggingface bert-embeddings /build/backend/python/rerankers/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/coqui/run.sh /build/backend/python/petals/run.sh /build/backend/python/transformers/run.sh /build/backend/python/bark/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/vllm/run.sh /build/backend/python/mamba/run.sh /build/backend/python/exllama/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/sentencetransformers/run.sh] api-1 | 12:17PM INF Trying to load the model 'b5869d55688a529c3738cb044e92c331' with the backend '[llama-cpp llama-ggml gpt4all llama-cpp-fallback rwkv stablediffusion whisper piper tinydream huggingface bert-embeddings /build/backend/python/rerankers/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/coqui/run.sh /build/backend/python/petals/run.sh /build/backend/python/transformers/run.sh /build/backend/python/bark/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/vllm/run.sh /build/backend/python/mamba/run.sh /build/backend/python/exllama/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/sentencetransformers/run.sh]' api-1 | 12:17PM INF [llama-cpp] Attempting to load api-1 | 12:17PM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-cpp api-1 | 12:17PM DBG Loading model in memory from file: /models/b5869d55688a529c3738cb044e92c331 api-1 | 12:17PM DBG Loading Model b5869d55688a529c3738cb044e92c331 with gRPC (file: /models/b5869d55688a529c3738cb044e92c331) (backend: llama-cpp): {backendString:llama-cpp model:b5869d55688a529c3738cb044e92c331 threads:16 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000a2e008 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false} api-1 | 12:17PM INF GPU device found but no CUDA backend present api-1 | 12:17PM INF [llama-cpp] attempting to load with AVX2 variant api-1 | 12:17PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-avx2 api-1 | 12:17PM DBG GRPC Service for b5869d55688a529c3738cb044e92c331 will be running at: '127.0.0.1:33719' api-1 | 12:17PM DBG GRPC Service state dir: /tmp/go-processmanager3395481196 api-1 | 12:17PM DBG GRPC Service Started api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stdout Server listening on 127.0.0.1:33719 api-1 | 12:17PM DBG GRPC Service Ready api-1 | 12:17PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:b5869d55688a529c3738cb044e92c331 ContextSize:8192 Seed:1036094942 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:16 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/b5869d55688a529c3738cb044e92c331 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false} api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /models/b5869d55688a529c3738cb044e92c331 (version GGUF V3 (latest)) api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 0: general.architecture str = llama api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 1: general.name str = Hermes-2-Pro-Llama-3-8B api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 2: llama.block_count u32 = 32 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 3: llama.context_length u32 = 8192 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 10: general.file_type u32 = 15 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 11: llama.vocab_size u32 = 128288 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128288] = ["!", "\"", "#", "$", "%", "&", "'", ... api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128288] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128003 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 128001 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 21: tokenizer.chat_template str = {{bos_token}}{% for message in messag... api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - kv 22: general.quantization_version u32 = 2 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - type f32: 65 tensors api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - type q4_K: 193 tensors api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llama_model_loader: - type q6_K: 33 tensors api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_vocab: special tokens definition check successful ( 288/128288 ). api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: format = GGUF V3 (latest) api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: arch = llama api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: vocab type = BPE api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_vocab = 128288 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_merges = 280147 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_ctx_train = 8192 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_embd = 4096 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_head = 32 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_head_kv = 8 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_layer = 32 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_rot = 128 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_embd_head_k = 128 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_embd_head_v = 128 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_gqa = 4 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_embd_k_gqa = 1024 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_embd_v_gqa = 1024 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: f_norm_eps = 0.0e+00 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: f_logit_scale = 0.0e+00 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_ff = 14336 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_expert = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_expert_used = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: causal attn = 1 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: pooling type = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: rope type = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: rope scaling = linear api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: freq_base_train = 500000.0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: freq_scale_train = 1 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: n_yarn_orig_ctx = 8192 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: rope_finetuned = unknown api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: ssm_d_conv = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: ssm_d_inner = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: ssm_d_state = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: ssm_dt_rank = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: model type = 8B api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: model ftype = Q4_K - Medium api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: model params = 8.03 B api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: model size = 4.58 GiB (4.89 BPW) api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: general.name = Hermes-2-Pro-Llama-3-8B api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: EOS token = 128003 '<|im_end|>' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: PAD token = 128001 '<|end_of_text|>' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: LF token = 128 'Ä' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr llm_load_print_meta: EOT token = 128003 '<|im_end|>' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr rocBLAS error: Cannot read /opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1103 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr List of available TensileLibrary Files : api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx941.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx940.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:33719): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat" api-1 | 12:17PM INF [llama-cpp] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF api-1 | 12:17PM INF [llama-ggml] Attempting to load api-1 | 12:17PM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-ggml api-1 | 12:17PM DBG Loading model in memory from file: /models/b5869d55688a529c3738cb044e92c331 api-1 | 12:17PM DBG Loading Model b5869d55688a529c3738cb044e92c331 with gRPC (file: /models/b5869d55688a529c3738cb044e92c331) (backend: llama-ggml): {backendString:llama-ggml model:b5869d55688a529c3738cb044e92c331 threads:16 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000a2e008 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false} api-1 | 12:17PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-ggml api-1 | 12:17PM DBG GRPC Service for b5869d55688a529c3738cb044e92c331 will be running at: '127.0.0.1:42821' api-1 | 12:17PM DBG GRPC Service state dir: /tmp/go-processmanager2801012293 api-1 | 12:17PM DBG GRPC Service Started api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:42821): stderr 2024/05/27 12:17:04 gRPC Server listening at 127.0.0.1:42821 api-1 | 12:17PM DBG GRPC Service Ready api-1 | 12:17PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:b5869d55688a529c3738cb044e92c331 ContextSize:8192 Seed:1036094942 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:16 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/b5869d55688a529c3738cb044e92c331 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false} api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:42821): stderr create_gpt_params: loading model /models/b5869d55688a529c3738cb044e92c331 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:42821): stderr llama.cpp: loading model from /models/b5869d55688a529c3738cb044e92c331 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:42821): stderr error loading model: unknown (magic, version) combination: 46554747, 00000003; is this really a GGML file? api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:42821): stderr llama_load_model_from_file: failed to load model api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:42821): stderr llama_init_from_gpt_params: error: failed to load model '/models/b5869d55688a529c3738cb044e92c331' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:42821): stderr load_binding_model: error: unable to load model api-1 | 12:17PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unknown desc = failed loading model api-1 | 12:17PM INF [gpt4all] Attempting to load api-1 | 12:17PM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend gpt4all api-1 | 12:17PM DBG Loading model in memory from file: /models/b5869d55688a529c3738cb044e92c331 api-1 | 12:17PM DBG Loading Model b5869d55688a529c3738cb044e92c331 with gRPC (file: /models/b5869d55688a529c3738cb044e92c331) (backend: gpt4all): {backendString:gpt4all model:b5869d55688a529c3738cb044e92c331 threads:16 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000a2e008 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false} api-1 | 12:17PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/gpt4all api-1 | 12:17PM DBG GRPC Service for b5869d55688a529c3738cb044e92c331 will be running at: '127.0.0.1:41103' api-1 | 12:17PM DBG GRPC Service state dir: /tmp/go-processmanager2003712508 api-1 | 12:17PM DBG GRPC Service Started api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:41103): stderr 2024/05/27 12:17:06 gRPC Server listening at 127.0.0.1:41103 api-1 | 12:17PM DBG GRPC Service Ready api-1 | 12:17PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:b5869d55688a529c3738cb044e92c331 ContextSize:8192 Seed:1036094942 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:16 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/b5869d55688a529c3738cb044e92c331 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false} api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:41103): stderr load_model: error 'Model format not supported (no matching implementation found)' api-1 | 12:17PM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model api-1 | 12:17PM INF [llama-cpp-fallback] Attempting to load api-1 | 12:17PM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-cpp-fallback api-1 | 12:17PM DBG Loading model in memory from file: /models/b5869d55688a529c3738cb044e92c331 api-1 | 12:17PM DBG Loading Model b5869d55688a529c3738cb044e92c331 with gRPC (file: /models/b5869d55688a529c3738cb044e92c331) (backend: llama-cpp-fallback): {backendString:llama-cpp-fallback model:b5869d55688a529c3738cb044e92c331 threads:16 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000a2e008 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false} api-1 | 12:17PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-fallback api-1 | 12:17PM DBG GRPC Service for b5869d55688a529c3738cb044e92c331 will be running at: '127.0.0.1:39359' api-1 | 12:17PM DBG GRPC Service state dir: /tmp/go-processmanager2088006060 api-1 | 12:17PM DBG GRPC Service Started api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stdout Server listening on 127.0.0.1:39359 api-1 | 12:17PM DBG GRPC Service Ready api-1 | 12:17PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:b5869d55688a529c3738cb044e92c331 ContextSize:8192 Seed:1036094942 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:16 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/b5869d55688a529c3738cb044e92c331 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false} api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /models/b5869d55688a529c3738cb044e92c331 (version GGUF V3 (latest)) api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 0: general.architecture str = llama api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 1: general.name str = Hermes-2-Pro-Llama-3-8B api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 2: llama.block_count u32 = 32 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 3: llama.context_length u32 = 8192 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 10: general.file_type u32 = 15 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 11: llama.vocab_size u32 = 128288 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128288] = ["!", "\"", "#", "$", "%", "&", "'", ... api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128288] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128003 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 128001 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 21: tokenizer.chat_template str = {{bos_token}}{% for message in messag... api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - kv 22: general.quantization_version u32 = 2 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - type f32: 65 tensors api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - type q4_K: 193 tensors api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llama_model_loader: - type q6_K: 33 tensors api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_vocab: special tokens definition check successful ( 288/128288 ). api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: format = GGUF V3 (latest) api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: arch = llama api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: vocab type = BPE api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_vocab = 128288 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_merges = 280147 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_ctx_train = 8192 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_embd = 4096 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_head = 32 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_head_kv = 8 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_layer = 32 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_rot = 128 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_embd_head_k = 128 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_embd_head_v = 128 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_gqa = 4 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_embd_k_gqa = 1024 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_embd_v_gqa = 1024 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: f_norm_eps = 0.0e+00 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: f_logit_scale = 0.0e+00 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_ff = 14336 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_expert = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_expert_used = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: causal attn = 1 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: pooling type = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: rope type = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: rope scaling = linear api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: freq_base_train = 500000.0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: freq_scale_train = 1 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: n_yarn_orig_ctx = 8192 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: rope_finetuned = unknown api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: ssm_d_conv = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: ssm_d_inner = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: ssm_d_state = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: ssm_dt_rank = 0 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: model type = 8B api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: model ftype = Q4_K - Medium api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: model params = 8.03 B api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: model size = 4.58 GiB (4.89 BPW) api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: general.name = Hermes-2-Pro-Llama-3-8B api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: EOS token = 128003 '<|im_end|>' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: PAD token = 128001 '<|end_of_text|>' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: LF token = 128 'Ä' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr llm_load_print_meta: EOT token = 128003 '<|im_end|>' api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr rocBLAS error: Cannot read /opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1103 api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr List of available TensileLibrary Files : api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx941.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx940.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat" api-1 | 12:17PM DBG GRPC(b5869d55688a529c3738cb044e92c331-127.0.0.1:39359): stderr "/opt/rocm-6.1.0/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat" api-1 | 12:17PM INF [llama-cpp-fallback] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF api-1 | 12:17PM INF [rwkv] Attempting to load [...] ```

bunder2015 commented 1 month ago

Hi, I noticed your log says "product unknown", did you set GPU_TARGETS? I would also try enabling DEBUG and REBUILD...

The only other thing that comes to mind is that you also have an Nvidia card in addition to the AMD card, I have a 980 kicking around, but I don't have the PCI-E bandwidth to install it into my threadripper.

I wish I could be of more help. Hopefully someone here might know what's up. Cheers

bunder2015 commented 1 month ago

I just did a quick search for gfx1103, this appears to be a Radeon 780M IGPU, which might not be supported by ROCm... https://github.com/ROCm/ROCm/discussions/2631#discussioncomment-8929948

mudler commented 1 month ago

@bunder2015 I see in the logs you have cuda set to true in the model - can you try by disabling it?

Hideman85 commented 1 month ago

@bunder2015 When you suggested to try docker compose I used the exact same compose you shared above where it does rebuild everything. But even though it does not look to work.

And rocm seem to work

``` docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video rocm/rocm-terminal rocm-user@4ee82370da0b:~$ rocm-smi =========================================== ROCm System Management Interface =========================================== ===================================================== Concise Info ===================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Socket) (Mem, Compute, ID) ======================================================================================================================== 0 1 0x15bf, 7764 40.0°C 18.076W N/A, N/A, 0 None 400Mhz 0% auto Unsupported 94% 1% ======================================================================================================================== ================================================= End of ROCm SMI Log ================================================== rocm-user@4ee82370da0b:~$ sudo rocminfo ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.13 Runtime Ext Version: 1.4 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 5263 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 32022284(0x1e89f0c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32022284(0x1e89f0c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 32022284(0x1e89f0c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1103 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 2048(0x800) KB Chip ID: 5567(0x15bf) ASIC Revision: 9(0x9) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2799 BDFID: 50176 Internal Node ID: 1 Compute Unit: 12 SIMDs per CU: 2 Shader Engines: 1 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 35 SDMA engine uCode:: 16 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 524288(0x80000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 524288(0x80000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1103 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```

What I do not understand is why the officially built All-In-One images are failing as well, I tried both cublas and hipblas none works, only cublas fallback to cpu that is working but I cannot make the use of any hardware acceleration.

bunder2015 commented 1 month ago

Hi, @mudler, sha-ba984c7-hipblas-ffmpeg with cuda set to false ~~loads the model into GPU memory, but~~ all the work is being done on the CPU and is really slow... I've been waiting 30+ minutes and diffusers/backend.py is just spinning. I don't think it's going to finish, and its not throwing any errors anywhere. log

mudler commented 1 month ago

Hi, @mudler, sha-ba984c7-hipblas-ffmpeg with cuda set to false loads the model into GPU memory, but all the work is being done on the CPU and is really slow... I've been waiting 30+ minutes and diffusers/backend.py is just spinning. I don't think it's going to finish, and its not throwing any errors anywhere.

gotcha - the fact that you can manage to offload to the GPU's ram tells that it can correctly use it, but I think the CUDA flag explicitly forces to cuda.

However, I can't find any docs around diffusers and hipblas directly - What I can tell is that we used to pick up torch from a different pip index here when this feature was first introduced: https://github.com/mudler/LocalAI/commit/fb0a4c5d9a1fa425bb1c61e354faf26efa41154a#diff-01623ead8ec22d05e4d7a70d687c15ea27485959956bc6e864ffb1d8e374afb9R29 , while now we take it from a different url https://github.com/mudler/LocalAI/blob/master/backend/python/diffusers/requirements-hipblas.txt#L1

any chance you can try building a container image with this index https://github.com/mudler/LocalAI/commit/fb0a4c5d9a1fa425bb1c61e354faf26efa41154a#diff-01623ead8ec22d05e4d7a70d687c15ea27485959956bc6e864ffb1d8e374afb9R29 ?

bunder2015 commented 1 month ago

I can't seem to find any images for fb0a4c5 on quay, but it looks like that commit belongs to v2.9.0... I can try that release if you would like, but v2.15.0 also works with cuda set to true... I think I originally started using localai around 2.12.x.

mudler commented 1 month ago

@bunder2015 what I mean is to build an image from current master branch manually, and swapping the index URL out. Sadly without an AMD card around there is going to be a little bit of back and forth:

I'd like to clear out if it's a problem of getting the dependencies from the correct repositories, and try to swap https://github.com/mudler/LocalAI/blob/e9c28a1ed7eef43ac5266029de5d9b3033c0103c/backend/python/diffusers/requirements-hipblas.txt#L1 with

 --pre --extra-index-url https://download.pytorch.org/whl/nightly/

instead

bunder2015 commented 1 month ago

Sadly without an AMD card around there is going to be a little bit of back and forth

That's okay, I don't mind...

what I mean is to build an image from current master branch manually, and swapping the index URL out.

Oh, I see what you mean now... I gave it 5 minutes, but it doesn't look like the model got loaded into memory. As a sanity check, I even pruned all my docker images and tried both urls again with cuda false on 100% fresh ba984c7 builds and got the same thing, I must have loaded another model by mistake prior to testing before. The old url with cuda true also gave me the "no nvidia driver" error.

Sorry for the confusion, and the delay (it takes a while to build from scratch).

mudler / LocalAI

Better Support for AMD and ROCM via docker containers. #1592