bug: Unable to chat with image using Moondream2 Vision model

Jan version

0.5.4

Describe the Bug

I can successfully load the model for chats, but as soon as I send an image, it crashes. Context:

I created a model.json to download the text and CLIP models.
I was successfully able to load the model and chat with it.
I've attached an image and sent it, but it crashes.

https://huggingface.co/moondream/moondream2-gguf

Same glitch on Linux here. https://discord.com/channels/1107178041848909847/1285784195125219338/1286348026973261835

Steps to Reproduce

Create a model.json file to interact with the model.
Send an image and request a description.

Screenshots / Logs

2024-09-19T12:11:04.250Z [CORTEX]::Debug: Request to kill cortex 2024-09-19T12:11:04.254Z [CORTEX]::Debug: 20240919 12:10:46.430861 UTC 3549698 DEBUG [LoadModel] Multi Modal Mode Enabled - llama_server_context.cc:159 20240919 12:10:46.676668 UTC 3549698 DEBUG [LoadModel] Request 4096 for context length for llava-1.6 - llama_server_context.cc:170 20240919 12:10:47.890831 UTC 3549698 DEBUG [Initialize] Available slots: - llama_server_context.cc:225 20240919 12:10:47.890848 UTC 3549698 DEBUG [Initialize] -> Slot 0 - max context: 4096 - llama_server_context.cc:233 20240919 12:10:47.890947 UTC 3549698 INFO Started background task here! - llama_server_context.cc:252 20240919 12:10:47.891006 UTC 3549698 INFO Warm-up model: llava-7b - llama_engine.cc:819 20240919 12:10:47.891010 UTC 3549742 DEBUG [UpdateSlots] all slots are idle and system prompt is empty, clear the KV cache - llama_server_context.cc:1250 20240919 12:10:47.891017 UTC 3549742 DEBUG [KvCacheClear] Clear the entire KV cache - llama_server_context.cc:258 20240919 12:10:47.901986 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 0] - llama_server_context.cc:623 20240919 12:10:47.902059 UTC 3549742 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 0, p0: 0 - llama_server_context.cc:1544 20240919 12:10:48.166076 UTC 3549742 DEBUG [PrintTimings] PrintTimings: prompt eval time = 172.433ms / 2 tokens (86.2165 ms per token, 11.5987079039 tokens per second) - llama_client_slot.cc:79 20240919 12:10:48.166081 UTC 3549742 DEBUG [PrintTimings] PrintTimings: eval time = 91.653 ms / 4 runs (22.91325 ms per token, 43.6428703916 tokens per second)

llama_client_slot.cc:86 20240919 12:10:48.166082 UTC 3549742 DEBUG [PrintTimings] PrintTimings: total time = 264.086 ms - llama_client_slot.cc:92 20240919 12:10:48.166116 UTC 3549742 INFO slot released: id_slot: 0, id_task: 0, n_ctx: 4096, n_past: 6, n_system_tokens: 0, n_cache_tokens: 0, truncated: 0 - llama_server_context.cc:1304 20240919 12:10:48.166129 UTC 3549698 INFO {"content":",\nI recently bought","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/louis/Library/Application Support/Jan/jan/models/llava-7b/llava-v1.6-mistral-7b.Q4_K_M.gguf","n_ctx":4096,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":false,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.0,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/louis/Library/Application Support/Jan/jan/models/llava-7b/llava-v1.6-mistral-7b.Q4_K_M.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":91.653,"predicted_n":4,"predicted_per_second":43.64287039158565,"predicted_per_token_ms":22.91325,"prompt_ms":172.433,"prompt_n":2,"prompt_per_second":11.598707903939502,"prompt_per_token_ms":86.2165},"tokens_cached":6,"tokens_evaluated":2,"tokens_predicted":4,"truncated":false} - llama_engine.cc:827 20240919 12:10:48.166183 UTC 3549698 INFO Model loaded successfully: llava-7b - llama_engine.cc:216 20240919 12:10:48.171967 UTC 3549699 INFO Model status responded - llama_engine.cc:259 20240919 12:10:48.175867 UTC 3549700 INFO Request 1, model llava-7b: Generating response for inference request - llama_engine.cc:469 20240919 12:10:48.175871 UTC 3549700 INFO Request 1: Stop words:null
llama_engine.cc:486 20240919 12:10:48.175892 UTC 3549700 INFO Request 1: Base64 image detected - llama_engine.cc:549 20240919 12:10:48.179648 UTC 3549700 INFO Request 1: Streamed, waiting for respone - llama_engine.cc:608 20240919 12:10:48.179692 UTC 3549700 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535 20240919 12:10:48.182143 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 - loaded image - llama_server_context.cc:562 20240919 12:10:48.182156 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 1] - llama_server_context.cc:623 20240919 12:10:48.182167 UTC 3549742 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 1, p0: 0 - llama_server_context.cc:1544 20240919 12:10:52.482469 UTC 3549701 INFO Request 2, model llava-7b: Generating response for inference request - llama_engine.cc:469 20240919 12:10:52.482483 UTC 3549701 INFO Request 2: Stop words:null
llama_engine.cc:486 20240919 12:11:04.251929 UTC 3549702 INFO Program is exitting, goodbye! - processManager.cc:8

2024-09-19T12:11:04.294Z [CORTEX]::Debug: cortex process is terminated 2024-09-19T12:11:04.294Z [CORTEX]::Debug: cortex exited with code: 0 2024-09-19T12:11:04.305Z [CORTEX]::CPU information - 10 2024-09-19T12:11:04.305Z [CORTEX]::Debug: Request to kill cortex 2024-09-19T12:11:04.306Z [CORTEX]::Debug: cortex process is terminated 2024-09-19T12:11:04.307Z [CORTEX]::Debug: Spawning cortex subprocess... 2024-09-19T12:11:04.307Z [CORTEX]::Debug: Spawn cortex at path: /Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64/cortex-cpp, and args: 1,127.0.0.1,3928 2024-09-19T12:11:04.307Z [CORTEX]::Debug: Cortex engine path: /Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64 2024-09-19T12:11:04.307Z [CORTEX] PATH: /usr/bin:/bin:/usr/sbin:/sbin::/Users/louis/Library/Application Support/Jan/jan/engines/@janhq/inference-cortex-extension/1.0.17:/Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64:/Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64 2024-09-19T12:11:04.410Z [CORTEX]::Debug: Loading model with params {"cpu_threads":10,"vision_model":true,"text_model":false,"ctx_len":2048,"prompt_template":"{system_message}\n### Instruction: {prompt}\n### Response:","llama_model_path":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","mmproj":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-mmproj-f16.gguf","system_prompt":"","user_prompt":"\n### Instruction: ","ai_prompt":"\n### Response:","model":"moondream2-f16.gguf","ngl":100} 2024-09-19T12:11:04.410Z [CORTEX]::Debug: cortex is ready 2024-09-19T12:11:04.419Z [CORTEX]::Debug: 20240919 12:11:04.315010 UTC 3550094 INFO cortex-cpp version: 0.5.0 - main.cc:73 20240919 12:11:04.315589 UTC 3550094 INFO Server started, listening at: 127.0.0.1:3928 - main.cc:78 20240919 12:11:04.315590 UTC 3550094 INFO Please load your model - main.cc:79 20240919 12:11:04.315592 UTC 3550094 INFO Number of thread is:10 - main.cc:86 20240919 12:11:04.411469 UTC 3550098 INFO CPU instruction set: fpu = 0| mmx = 0| sse = 0| sse2 = 0| sse3 = 0| ssse3 = 0| sse4_1 = 0| sse4_2 = 0| pclmulqdq = 0| avx = 0| avx2 = 0| avx512_f = 0| avx512_dq = 0| avx512_ifma = 0| avx512_pf = 0| avx512_er = 0| avx512_cd = 0| avx512_bw = 0| has_avx512_vl = 0| has_avx512_vbmi = 0| has_avx512_vbmi2 = 0| avx512_vnni = 0| avx512_bitalg = 0| avx512_vpopcntdq = 0| avx512_4vnniw = 0| avx512_4fmaps = 0| avx512_vp2intersect = 0| aes = 0| f16c = 0| - server.cc:288 20240919 12:11:04.418604 UTC 3550098 INFO Loaded engine: cortex.llamacpp - server.cc:314 20240919 12:11:04.418615 UTC 3550098 INFO cortex.llamacpp version: 0.1.25 - llama_engine.cc:163 20240919 12:11:04.418638 UTC 3550098 INFO MMPROJ FILE detected, multi-model enabled! - llama_engine.cc:300 20240919 12:11:04.418667 UTC 3550098 INFO Number of parallel is set to 1 - llama_engine.cc:352 20240919 12:11:04.418670 UTC 3550098 DEBUG [LoadModelImpl] cache_type: f16 - llama_engine.cc:365 20240919 12:11:04.418672 UTC 3550098 DEBUG [LoadModelImpl] Enabled Flash Attention - llama_engine.cc:374 20240919 12:11:04.418679 UTC 3550098 DEBUG [LoadModelImpl] stop: null

llama_engine.cc:395 {"timestamp":1726747864,"level":"INFO","function":"LoadModelImpl","line":418,"message":"system info","n_threads":10,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "}

2024-09-19T12:11:04.420Z [CORTEX]::Error: ggml_metal_init: allocating

2024-09-19T12:11:04.431Z [CORTEX]::Error: ggml_metal_init: found device: Apple M2 Pro

2024-09-19T12:11:04.458Z [CORTEX]::Error: ggml_metal_init: picking default device: Apple M2 Pro

2024-09-19T12:11:04.459Z [CORTEX]::Error: ggml_metal_init: using embedded metal library

2024-09-19T12:11:04.462Z [CORTEX]::Error: ggml_metal_init: GPU name: Apple M2 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB

2024-09-19T12:11:04.841Z [CORTEX]::Error: llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from /Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi2 llama_model_loader: - kv 1: general.name str = moondream2 llama_model_loader: - kv 2: phi2.context_length u32 = 2048 llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048 llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192 llama_model_loader: - kv 5: phi2.block_count u32 = 24 llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2

2024-09-19T12:11:04.845Z [CORTEX]::Error: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...

2024-09-19T12:11:04.846Z [CORTEX]::Error: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

2024-09-19T12:11:04.850Z [CORTEX]::Error: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 llama_model_loader: - type f32: 147 tensors llama_model_loader: - type f16: 98 tensors

2024-09-19T12:11:04.874Z [CORTEX]::Error: llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab:
llm_load_vocab: ****
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ****
llm_load_vocab:

2024-09-19T12:11:04.881Z [CORTEX]::Error: llm_load_vocab: special tokens cache size = 944

2024-09-19T12:11:04.889Z [CORTEX]::Error: llm_load_vocab: token to piece cache size = 0.3151 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 51200 llm_load_print_meta: n_merges = 50000 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_layer = 24

2024-09-19T12:11:04.889Z [CORTEX]::Error: llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 1.42 B llm_load_print_meta: model size = 2.64 GiB (16.01 BPW) llm_load_print_meta: general.name = moondream2 llm_load_print_meta: BOS token = 50256 '<|endoftext|>' llm_load_print_meta: EOS token = 50256 '<|endoftext|>' llm_load_print_meta: UNK token = 50256 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 50256 '<|endoftext|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.22 MiB

2024-09-19T12:11:04.890Z [CORTEX]::Error: ggml_backend_metal_log_allocated_size: allocated buffer, size = 2506.30 MiB, ( 3425.89 / 21845.34) llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 200.00 MiB llm_load_tensors: Metal buffer size = 2506.29 MiB

2024-09-19T12:11:04.890Z [CORTEX]::Error: ..................................... 2024-09-19T12:11:04.890Z [CORTEX]::Error: ..................... 2024-09-19T12:11:04.890Z [CORTEX]::Error: ......................

2024-09-19T12:11:04.892Z [CORTEX]::Error: llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating

2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: found device: Apple M2 Pro

2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: picking default device: Apple M2 Pro

2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: using embedded metal library

2024-09-19T12:11:04.894Z [CORTEX]::Error: ggml_metal_init: GPU name: Apple M2 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB

2024-09-19T12:11:04.928Z [CORTEX]::Error: llama_kv_cache_init: Metal KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CPU output buffer size = 0.20 MiB

2024-09-19T12:11:04.929Z [CORTEX]::Error: llama_new_context_with_model: Metal compute buffer size = 416.00 MiB llama_new_context_with_model: CPU compute buffer size = 32.02 MiB llama_new_context_with_model: graph nodes = 826 llama_new_context_with_model: graph splits = 2

2024-09-19T12:11:06.399Z [CORTEX]::Debug: Load model success with response {} 2024-09-19T12:11:06.399Z [CORTEX]::Debug: Validating model moondream2-f16.gguf 2024-09-19T12:11:06.400Z [CORTEX]::Debug: Validate model state with response 200 2024-09-19T12:11:06.401Z [CORTEX]::Debug: Validate model state success with response {"model_data":"{\"frequency_penalty\":0.0,\"grammar\":\"\",\"ignore_eos\":false,\"logit_bias\":[],\"min_p\":0.05000000074505806,\"mirostat\":0,\"mirostat_eta\":0.10000000149011612,\"mirostat_tau\":5.0,\"model\":\"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf\",\"n_ctx\":2048,\"n_keep\":0,\"n_predict\":2,\"n_probs\":0,\"penalize_nl\":false,\"penalty_prompt_tokens\":[],\"presence_penalty\":0.0,\"repeat_last_n\":64,\"repeat_penalty\":1.0,\"seed\":4294967295,\"stop\":[],\"stream\":false,\"temperature\":0.800000011920929,\"tfs_z\":1.0,\"top_k\":40,\"top_p\":0.949999988079071,\"typical_p\":1.0,\"use_penalty_prompt_tokens\":false}","model_loaded":true} 2024-09-19T12:11:06.408Z [CORTEX]::Error: libc++abi: terminating due to uncaught exception of type std::length_error: vector

2024-09-19T12:11:06.408Z [CORTEX]::Debug: 20240919 12:11:04.419177 UTC 3550098 DEBUG [LoadModel] Multi Modal Mode Enabled - llama_server_context.cc:159 20240919 12:11:06.301128 UTC 3550098 DEBUG [Initialize] Available slots: - llama_server_context.cc:225 20240919 12:11:06.301136 UTC 3550098 DEBUG [Initialize] -> Slot 0 - max context: 2048 - llama_server_context.cc:233 20240919 12:11:06.301210 UTC 3550098 INFO Started background task here! - llama_server_context.cc:252 20240919 12:11:06.301254 UTC 3550098 INFO Warm-up model: moondream2-f16.gguf - llama_engine.cc:819 20240919 12:11:06.301257 UTC 3550146 DEBUG [UpdateSlots] all slots are idle and system prompt is empty, clear the KV cache - llama_server_context.cc:1250 20240919 12:11:06.301262 UTC 3550146 DEBUG [KvCacheClear] Clear the entire KV cache - llama_server_context.cc:258 20240919 12:11:06.304526 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 0] - llama_server_context.cc:623 20240919 12:11:06.304589 UTC 3550146 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 0, p0: 0 - llama_server_context.cc:1544 20240919 12:11:06.397659 UTC 3550146 DEBUG [PrintTimings] PrintTimings: prompt eval time = 38.775ms / 1 tokens (38.775 ms per token, 25.7898130239 tokens per second) - llama_client_slot.cc:79 20240919 12:11:06.397667 UTC 3550146 DEBUG [PrintTimings] PrintTimings: eval time = 54.356 ms / 4 runs (13.589 ms per token, 73.5889322246 tokens per second)

llama_client_slot.cc:86 20240919 12:11:06.397668 UTC 3550146 DEBUG [PrintTimings] PrintTimings: total time = 93.131 ms - llama_client_slot.cc:92 20240919 12:11:06.397727 UTC 3550146 INFO slot released: id_slot: 0, id_task: 0, n_ctx: 2048, n_past: 5, n_system_tokens: 0, n_cache_tokens: 0, truncated: 0 - llama_server_context.cc:1304 20240919 12:11:06.397739 UTC 3550098 INFO {"content":", Alien friend! Today","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","n_ctx":2048,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":false,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.0,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":54.356,"predicted_n":4,"predicted_per_second":73.58893222459342,"predicted_per_token_ms":13.589,"prompt_ms":38.775,"prompt_n":1,"prompt_per_second":25.78981302385558,"prompt_per_token_ms":38.775},"tokens_cached":5,"tokens_evaluated":1,"tokens_predicted":4,"truncated":false} - llama_engine.cc:827 20240919 12:11:06.397784 UTC 3550098 INFO Model loaded successfully: moondream2-f16.gguf - llama_engine.cc:216 20240919 12:11:06.400552 UTC 3550099 INFO Model status responded - llama_engine.cc:259 20240919 12:11:06.402786 UTC 3550100 INFO Request 1, model moondream2-f16.gguf: Generating response for inference request - llama_engine.cc:469 20240919 12:11:06.402791 UTC 3550100 INFO Request 1: Stop words:[ "<|END_OF_TURN_TOKEN|>", "", "[/INST]", "<|end_of_text|>", "<|eot_id|>", "<|im_end|>", "<|end|>" ]
llama_engine.cc:486 20240919 12:11:06.402820 UTC 3550100 INFO Request 1: Base64 image detected - llama_engine.cc:549 20240919 12:11:06.406590 UTC 3550100 INFO Request 1: Streamed, waiting for respone - llama_engine.cc:608 20240919 12:11:06.406633 UTC 3550100 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535 20240919 12:11:06.408420 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 - loaded image - llama_server_context.cc:562 20240919 12:11:06.408434 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 1] - llama_server_context.cc:623 20240919 12:11:06.408442 UTC 3550146 DEBUG [UpdateSlots] slot 0 : we have to evaluate at least 1 token to generate logits - llama_server_context.cc:1496

2024-09-19T12:11:06.409Z [CORTEX]::Debug: cortex exited with code: null

What is your OS?

[X] MacOS
[ ] Windows
[ ] Linux

janhq / jan