kherud / java-llama.cpp

Java Bindings for llama.cpp - A Port of Facebook's LLaMA model in C/C++
MIT License
279 stars 28 forks source link

fail with snowflake-arctic-embed-l #59

Closed fbellomi closed 3 months ago

fbellomi commented 4 months ago

Hello,

I'm trying to use snowflake-arctic-embed-l for embedding,

I'm using https://huggingface.co/ChristianAzinn/snowflake-arctic-embed-l-gguf

I'm on MacOS x86-64 (CPU, no CUDA), using directly the maven dependency (no GPU setup)

It fails with this message below,

I'm not sure if this is the problem, but it seems to pick up the AMD Radeon Pro 575X (the graphic accelerator) and try to use it as a GPU, and I don't know how to disable this

as a quick test, I tried to use the last version of ollama (which uses a more recent build of llama.cpp) and it works fine on my system, but I'm not really sure if the issue is the version of llama.cpp

Thanks for any help, and thanks for your efforts with java-llama.cpp

Francesco

/de/kherud/llama/Mac/x86_64
'ggml-metal.metal' not found
Extracted 'libllama.dylib' to '/var/folders/v5/2ptcmns52kdb70jxp6y8lcjw0000gn/T/libllama.dylib'
Extracted 'libjllama.dylib' to '/var/folders/v5/2ptcmns52kdb70jxp6y8lcjw0000gn/T/libjllama.dylib'
{"tid":"0x70000cda8000","timestamp":1714832877,"level":"INFO","function":"Java_de_kherud_llama_LlamaModel_loadModel","line":283,"msg":"build info","build":2702,"commit":"b97bc396"}
{"tid":"0x70000cda8000","timestamp":1714832877,"level":"INFO","function":"Java_de_kherud_llama_LlamaModel_loadModel","line":290,"msg":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":6,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "}
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from ../ml/models/snowflake-arctic-embed-l-f16.GGUF (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = snowflake-arctic-embed-l
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 2
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4096
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 2
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 334.09 M
llm_load_print_meta: model size       = 637.85 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = snowflake-arctic-embed-l
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_tensors: ggml ctx size =    0.35 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   637.85 MiB, (  638.02 /  4096.00)
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:        CPU buffer size =    60.62 MiB
llm_load_tensors:      Metal buffer size =   637.85 MiB
................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 575X
ggml_metal_init: picking default device: AMD Radeon Pro 575X
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   AMD Radeon Pro 575X
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = false
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  =  4294.97 MB
ggml_metal_init: skipping kernel_soft_max                  (not supported)
ggml_metal_init: skipping kernel_soft_max_4                (not supported)
ggml_metal_init: skipping kernel_rms_norm                  (not supported)
ggml_metal_init: skipping kernel_group_norm                (not supported)
ggml_metal_init: skipping kernel_mul_mv_f32_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mv_f16_f16            (not supported)
ggml_metal_init: skipping kernel_mul_mv_f16_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mv_f16_f32_1row       (not supported)
ggml_metal_init: skipping kernel_mul_mv_f16_f32_l4         (not supported)
ggml_metal_init: skipping kernel_mul_mv_q4_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q4_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q5_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q5_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q8_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q2_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q3_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q4_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q5_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q6_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq2_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq2_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq3_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq3_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq2_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq1_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq1_m_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq4_nl_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq4_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_f32_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_f16_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q4_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q4_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q5_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q5_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q8_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q2_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q3_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q4_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q5_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q6_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq2_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq2_xs_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq3_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq3_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq2_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq1_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq1_m_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq4_nl_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq4_xs_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_f32_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32      (not supported)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    48.00 MiB, (  702.23 /  4096.00)
llama_kv_cache_init:      Metal KV buffer size =    48.00 MiB
llama_new_context_with_model: KV self size  =   48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    25.01 MiB, (  727.24 /  4096.00)
llama_new_context_with_model:      Metal compute buffer size =    25.01 MiB
llama_new_context_with_model:        CPU compute buffer size =     5.01 MiB
llama_new_context_with_model: graph nodes  = 849
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute_block_invoke: error: unsupported op 'MUL_MAT'
GGML_ASSERT: /Users/runner/work/java-llama.cpp/java-llama.cpp/build/_deps/llama.cpp-src/ggml-metal.m:879: !"unsupported op"
ggml_metal_graph_compute_block_invoke: error: unsupported op 'MUL_MAT'
GGML_ASSERT: /Users/runner/work/java-llama.cpp/java-llama.cpp/build/_deps/llama.cpp-src/ggml-metal.m:879: !"unsupported op"
kherud commented 4 months ago

Hey @fbellomi I just upgraded the binding to the latest llama.cpp version (java-llama.cpp version 3.0.2). Can you please check if the problem persists? If it does, I'll have a closer look.

fbellomi commented 4 months ago

@kherud, thanks for your quick reply

I tried with 3.0.2, but it seems to fail to load the binary library

/de/kherud/llama/Mac/x86_64
'ggml-metal.metal' not found
Extracted 'libllama.dylib' to '/var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib'
/private/var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib: dlopen(/private/var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib, 0x0001): tried: '/private/var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib' (mach-o file, but is an incompatible architecture (have (arm64), need (x86_64h)))
Failed to load native library: /var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib. osinfo: Mac/x86_64
Exception in thread "main" java.lang.UnsatisfiedLinkError: No native library found for os.name=Mac, os.arch=x86_64, paths=[/de/kherud/llama/Mac/x86_64:/Users/francesco/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.]
    at de.kherud.llama.LlamaLoader.loadNativeLibrary(LlamaLoader.java:158)
    at de.kherud.llama.LlamaLoader.initialize(LlamaLoader.java:65)
    at de.kherud.llama.LlamaModel.<clinit>(LlamaModel.java:27)
    at com.creactives.llm.UNSPSCEmbeddingsJLL.main(UNSPSCEmbeddingsJLL.java:40)

It seems to correctly recognize the x66_64 architecture, but it does not like the extracted library

I re-checked with 3.0.1 and it keeps failing like the comment before, so after it correctly loaded the library; so this issue appears to be specific to 3.0.2. I also tried to reset the Gradle cache and re-download the lib.

I checked the downloaded jar on my file system, and it contains the binaries in /de/kherud/llama/Mac/x86_64 (I don't know how to check if they are well-formed)

Thanks, Francesco

kherud commented 4 months ago

Thanks for the feedback, I'll look into it later today.

kherud commented 4 months ago

Hey @fbellomi sorry for the late reply. I think I found the problem: The GitHub actions runner macos-latest changed from x86_64 to arm64 at some point in time. The build workflow of this repository still used macos-latest in the x86_64 job, though. That's why you got the UnsatisfiedLinkError. The library was wrongfully built for arm64, but then moved to the x86_64 Java resources directory. I hope everything works for you now with version 3.1.0.

fbellomi commented 4 months ago

Hi, thanks for your support

upgraded to 3.1.0

Still no luck with loading the native lib, but got a different error

/de/kherud/llama/Mac/x86_64
'ggml-metal.metal' not found
Extracted 'libllama.dylib' to '/var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib'
/private/var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib: dlopen(/private/var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib, 0x0001): Symbol not found: (_cblas_sgemm$NEWLAPACK$ILP64)
  Referenced from: '/private/var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib'
  Expected in: '/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate'
Failed to load native library: /var/folders/h0/xch4xg717wq862s2hppsfyc00000gn/T/libllama.dylib. osinfo: Mac/x86_64
Exception in thread "main" java.lang.UnsatisfiedLinkError: No native library found for os.name=Mac, os.arch=x86_64, paths=[/de/kherud/llama/Mac/x86_64:/Users/francesco/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.]
    at de.kherud.llama.LlamaLoader.loadNativeLibrary(LlamaLoader.java:158)
    at de.kherud.llama.LlamaLoader.initialize(LlamaLoader.java:65)
    at de.kherud.llama.LlamaModel.<clinit>(LlamaModel.java:22)
    at com.creactives.llm.UNSPSCEmbeddingsJLL.main(UNSPSCEmbeddingsJLL.java:40)

thanks, Francesco

fbellomi commented 3 months ago

I've tested with version 3.2.1 and it works as expected.

I'm closing this issue as resolved