janhq / jan

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer. Multiple engine support (llama.cpp, TensorRT-LLM)
https://jan.ai/
GNU Affero General Public License v3.0
22.34k stars 1.28k forks source link

bug: Unable to chat with image using Moondream2 Vision model #3705

Open louis-jan opened 1 week ago

louis-jan commented 1 week ago

Jan version

0.5.4

Describe the Bug

I can successfully load the model for chats, but as soon as I send an image, it crashes. Context:

https://huggingface.co/moondream/moondream2-gguf

Same glitch on Linux here. https://discord.com/channels/1107178041848909847/1285784195125219338/1286348026973261835

Steps to Reproduce

  1. Create a model.json file to interact with the model.
  2. Send an image and request a description.

Screenshots / Logs

Screenshot 2024-09-19 at 19 11 09

2024-09-19T12:11:04.250Z [CORTEX]::Debug: Request to kill cortex 2024-09-19T12:11:04.254Z [CORTEX]::Debug: 20240919 12:10:46.430861 UTC 3549698 DEBUG [LoadModel] Multi Modal Mode Enabled - llama_server_context.cc:159 20240919 12:10:46.676668 UTC 3549698 DEBUG [LoadModel] Request 4096 for context length for llava-1.6 - llama_server_context.cc:170 20240919 12:10:47.890831 UTC 3549698 DEBUG [Initialize] Available slots: - llama_server_context.cc:225 20240919 12:10:47.890848 UTC 3549698 DEBUG [Initialize] -> Slot 0 - max context: 4096 - llama_server_context.cc:233 20240919 12:10:47.890947 UTC 3549698 INFO Started background task here! - llama_server_context.cc:252 20240919 12:10:47.891006 UTC 3549698 INFO Warm-up model: llava-7b - llama_engine.cc:819 20240919 12:10:47.891010 UTC 3549742 DEBUG [UpdateSlots] all slots are idle and system prompt is empty, clear the KV cache - llama_server_context.cc:1250 20240919 12:10:47.891017 UTC 3549742 DEBUG [KvCacheClear] Clear the entire KV cache - llama_server_context.cc:258 20240919 12:10:47.901986 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 0] - llama_server_context.cc:623 20240919 12:10:47.902059 UTC 3549742 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 0, p0: 0 - llama_server_context.cc:1544 20240919 12:10:48.166076 UTC 3549742 DEBUG [PrintTimings] PrintTimings: prompt eval time = 172.433ms / 2 tokens (86.2165 ms per token, 11.5987079039 tokens per second) - llama_client_slot.cc:79 20240919 12:10:48.166081 UTC 3549742 DEBUG [PrintTimings] PrintTimings: eval time = 91.653 ms / 4 runs (22.91325 ms per token, 43.6428703916 tokens per second)

2024-09-19T12:11:04.294Z [CORTEX]::Debug: cortex process is terminated 2024-09-19T12:11:04.294Z [CORTEX]::Debug: cortex exited with code: 0 2024-09-19T12:11:04.305Z [CORTEX]::CPU information - 10 2024-09-19T12:11:04.305Z [CORTEX]::Debug: Request to kill cortex 2024-09-19T12:11:04.306Z [CORTEX]::Debug: cortex process is terminated 2024-09-19T12:11:04.307Z [CORTEX]::Debug: Spawning cortex subprocess... 2024-09-19T12:11:04.307Z [CORTEX]::Debug: Spawn cortex at path: /Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64/cortex-cpp, and args: 1,127.0.0.1,3928 2024-09-19T12:11:04.307Z [CORTEX]::Debug: Cortex engine path: /Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64 2024-09-19T12:11:04.307Z [CORTEX] PATH: /usr/bin:/bin:/usr/sbin:/sbin::/Users/louis/Library/Application Support/Jan/jan/engines/@janhq/inference-cortex-extension/1.0.17:/Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64:/Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64 2024-09-19T12:11:04.410Z [CORTEX]::Debug: Loading model with params {"cpu_threads":10,"vision_model":true,"text_model":false,"ctx_len":2048,"prompt_template":"{system_message}\n### Instruction: {prompt}\n### Response:","llama_model_path":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","mmproj":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-mmproj-f16.gguf","system_prompt":"","user_prompt":"\n### Instruction: ","ai_prompt":"\n### Response:","model":"moondream2-f16.gguf","ngl":100} 2024-09-19T12:11:04.410Z [CORTEX]::Debug: cortex is ready 2024-09-19T12:11:04.419Z [CORTEX]::Debug: 20240919 12:11:04.315010 UTC 3550094 INFO cortex-cpp version: 0.5.0 - main.cc:73 20240919 12:11:04.315589 UTC 3550094 INFO Server started, listening at: 127.0.0.1:3928 - main.cc:78 20240919 12:11:04.315590 UTC 3550094 INFO Please load your model - main.cc:79 20240919 12:11:04.315592 UTC 3550094 INFO Number of thread is:10 - main.cc:86 20240919 12:11:04.411469 UTC 3550098 INFO CPU instruction set: fpu = 0| mmx = 0| sse = 0| sse2 = 0| sse3 = 0| ssse3 = 0| sse4_1 = 0| sse4_2 = 0| pclmulqdq = 0| avx = 0| avx2 = 0| avx512_f = 0| avx512_dq = 0| avx512_ifma = 0| avx512_pf = 0| avx512_er = 0| avx512_cd = 0| avx512_bw = 0| has_avx512_vl = 0| has_avx512_vbmi = 0| has_avx512_vbmi2 = 0| avx512_vnni = 0| avx512_bitalg = 0| avx512_vpopcntdq = 0| avx512_4vnniw = 0| avx512_4fmaps = 0| avx512_vp2intersect = 0| aes = 0| f16c = 0| - server.cc:288 20240919 12:11:04.418604 UTC 3550098 INFO Loaded engine: cortex.llamacpp - server.cc:314 20240919 12:11:04.418615 UTC 3550098 INFO cortex.llamacpp version: 0.1.25 - llama_engine.cc:163 20240919 12:11:04.418638 UTC 3550098 INFO MMPROJ FILE detected, multi-model enabled! - llama_engine.cc:300 20240919 12:11:04.418667 UTC 3550098 INFO Number of parallel is set to 1 - llama_engine.cc:352 20240919 12:11:04.418670 UTC 3550098 DEBUG [LoadModelImpl] cache_type: f16 - llama_engine.cc:365 20240919 12:11:04.418672 UTC 3550098 DEBUG [LoadModelImpl] Enabled Flash Attention - llama_engine.cc:374 20240919 12:11:04.418679 UTC 3550098 DEBUG [LoadModelImpl] stop: null

2024-09-19T12:11:04.420Z [CORTEX]::Error: ggml_metal_init: allocating

2024-09-19T12:11:04.431Z [CORTEX]::Error: ggml_metal_init: found device: Apple M2 Pro

2024-09-19T12:11:04.458Z [CORTEX]::Error: ggml_metal_init: picking default device: Apple M2 Pro

2024-09-19T12:11:04.459Z [CORTEX]::Error: ggml_metal_init: using embedded metal library

2024-09-19T12:11:04.462Z [CORTEX]::Error: ggml_metal_init: GPU name: Apple M2 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB

2024-09-19T12:11:04.841Z [CORTEX]::Error: llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from /Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi2 llama_model_loader: - kv 1: general.name str = moondream2 llama_model_loader: - kv 2: phi2.context_length u32 = 2048 llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048 llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192 llama_model_loader: - kv 5: phi2.block_count u32 = 24 llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2

2024-09-19T12:11:04.845Z [CORTEX]::Error: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...

2024-09-19T12:11:04.846Z [CORTEX]::Error: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

2024-09-19T12:11:04.850Z [CORTEX]::Error: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 llama_model_loader: - type f32: 147 tensors llama_model_loader: - type f16: 98 tensors

2024-09-19T12:11:04.874Z [CORTEX]::Error: llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab:
llm_load_vocab: ****
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ****
llm_load_vocab:

2024-09-19T12:11:04.881Z [CORTEX]::Error: llm_load_vocab: special tokens cache size = 944

2024-09-19T12:11:04.889Z [CORTEX]::Error: llm_load_vocab: token to piece cache size = 0.3151 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 51200 llm_load_print_meta: n_merges = 50000 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_layer = 24

2024-09-19T12:11:04.889Z [CORTEX]::Error: llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 1.42 B llm_load_print_meta: model size = 2.64 GiB (16.01 BPW) llm_load_print_meta: general.name = moondream2 llm_load_print_meta: BOS token = 50256 '<|endoftext|>' llm_load_print_meta: EOS token = 50256 '<|endoftext|>' llm_load_print_meta: UNK token = 50256 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 50256 '<|endoftext|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.22 MiB

2024-09-19T12:11:04.890Z [CORTEX]::Error: ggml_backend_metal_log_allocated_size: allocated buffer, size = 2506.30 MiB, ( 3425.89 / 21845.34) llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 200.00 MiB llm_load_tensors: Metal buffer size = 2506.29 MiB

2024-09-19T12:11:04.890Z [CORTEX]::Error: ..................................... 2024-09-19T12:11:04.890Z [CORTEX]::Error: ..................... 2024-09-19T12:11:04.890Z [CORTEX]::Error: ......................

2024-09-19T12:11:04.892Z [CORTEX]::Error: llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating

2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: found device: Apple M2 Pro

2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: picking default device: Apple M2 Pro

2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: using embedded metal library

2024-09-19T12:11:04.894Z [CORTEX]::Error: ggml_metal_init: GPU name: Apple M2 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB

2024-09-19T12:11:04.928Z [CORTEX]::Error: llama_kv_cache_init: Metal KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CPU output buffer size = 0.20 MiB

2024-09-19T12:11:04.929Z [CORTEX]::Error: llama_new_context_with_model: Metal compute buffer size = 416.00 MiB llama_new_context_with_model: CPU compute buffer size = 32.02 MiB llama_new_context_with_model: graph nodes = 826 llama_new_context_with_model: graph splits = 2

2024-09-19T12:11:06.399Z [CORTEX]::Debug: Load model success with response {} 2024-09-19T12:11:06.399Z [CORTEX]::Debug: Validating model moondream2-f16.gguf 2024-09-19T12:11:06.400Z [CORTEX]::Debug: Validate model state with response 200 2024-09-19T12:11:06.401Z [CORTEX]::Debug: Validate model state success with response {"model_data":"{\"frequency_penalty\":0.0,\"grammar\":\"\",\"ignore_eos\":false,\"logit_bias\":[],\"min_p\":0.05000000074505806,\"mirostat\":0,\"mirostat_eta\":0.10000000149011612,\"mirostat_tau\":5.0,\"model\":\"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf\",\"n_ctx\":2048,\"n_keep\":0,\"n_predict\":2,\"n_probs\":0,\"penalize_nl\":false,\"penalty_prompt_tokens\":[],\"presence_penalty\":0.0,\"repeat_last_n\":64,\"repeat_penalty\":1.0,\"seed\":4294967295,\"stop\":[],\"stream\":false,\"temperature\":0.800000011920929,\"tfs_z\":1.0,\"top_k\":40,\"top_p\":0.949999988079071,\"typical_p\":1.0,\"use_penalty_prompt_tokens\":false}","model_loaded":true} 2024-09-19T12:11:06.408Z [CORTEX]::Error: libc++abi: terminating due to uncaught exception of type std::length_error: vector

2024-09-19T12:11:06.408Z [CORTEX]::Debug: 20240919 12:11:04.419177 UTC 3550098 DEBUG [LoadModel] Multi Modal Mode Enabled - llama_server_context.cc:159 20240919 12:11:06.301128 UTC 3550098 DEBUG [Initialize] Available slots: - llama_server_context.cc:225 20240919 12:11:06.301136 UTC 3550098 DEBUG [Initialize] -> Slot 0 - max context: 2048 - llama_server_context.cc:233 20240919 12:11:06.301210 UTC 3550098 INFO Started background task here! - llama_server_context.cc:252 20240919 12:11:06.301254 UTC 3550098 INFO Warm-up model: moondream2-f16.gguf - llama_engine.cc:819 20240919 12:11:06.301257 UTC 3550146 DEBUG [UpdateSlots] all slots are idle and system prompt is empty, clear the KV cache - llama_server_context.cc:1250 20240919 12:11:06.301262 UTC 3550146 DEBUG [KvCacheClear] Clear the entire KV cache - llama_server_context.cc:258 20240919 12:11:06.304526 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 0] - llama_server_context.cc:623 20240919 12:11:06.304589 UTC 3550146 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 0, p0: 0 - llama_server_context.cc:1544 20240919 12:11:06.397659 UTC 3550146 DEBUG [PrintTimings] PrintTimings: prompt eval time = 38.775ms / 1 tokens (38.775 ms per token, 25.7898130239 tokens per second) - llama_client_slot.cc:79 20240919 12:11:06.397667 UTC 3550146 DEBUG [PrintTimings] PrintTimings: eval time = 54.356 ms / 4 runs (13.589 ms per token, 73.5889322246 tokens per second)

2024-09-19T12:11:06.409Z [CORTEX]::Debug: cortex exited with code: null

What is your OS?

louis-jan commented 1 week ago

model.json

{
  "object": "model",
  "version": "1.0",
  "format": "gguf",
  "sources": [
    {
      "url": "https://huggingface.co/moondream/moondream2-gguf/resolve/main/moondream2-text-model-f16.gguf",
      "filename": "moondream2-f16.gguf"
    },
    {
      "url": "https://huggingface.co/moondream/moondream2-gguf/resolve/main/moondream2-mmproj-f16.gguf",
      "filename": "moondream2-mmproj-f16.gguf"
    }
  ],
  "id": "moondream2-f16.gguf",
  "name": "Moondream 2",
  "created": 1726572950042,
  "description": "User self import model",
  "settings": {
    "vision_model": true,
    "text_model": false,
    "ctx_len": 2048,
    "prompt_template": "{system_message}\n### Instruction: {prompt}\n### Response:",
    "llama_model_path": "moondream2-f16.gguf",
    "mmproj": "moondream2-mmproj-f16.gguf"
  },
  "parameters": {
    "temperature": 0.7,
    "top_p": 0.95,
    "stream": true,
    "max_tokens": 2048,
    "stop": [
      "<|END_OF_TURN_TOKEN|>",
      "<end_of_turn>",
      "[/INST]",
      "<|end_of_text|>",
      "<|eot_id|>",
      "<|im_end|>",
      "<|end|>"
    ],
    "frequency_penalty": 0,
    "presence_penalty": 0
  },
  "metadata": {
    "author": "User",
    "tags": ["gguf", "region:us"],
    "size": "909777984"
  },
  "engine": "nitro"
}