bug: Cortex process handling issue causes API server to become unresponsive

CrazybutSolid commented 3 weeks ago

Jan version

0.5.7

Describe the Bug

Using Jan v0.5.7 on a Mac with an M1 processor, running Llama 3.2 3B instruct q8 via the API. Occasionally, the server stops responding to POST requests on /v1/chat/completions with no response data. Restarting Jan does not resolve the issue; only a full Mac restart restores normal behavior.

Steps to Reproduce

Set up Jan v0.5.7 on a Mac with an M1 chip. Run Llama 3.2 3B instruct q8 through the API. Send multiple POST requests to /v1/chat/completions. Observe that after an unpredictable period, Jan ceases to respond. Attempt to restart Jan; note that responses do not resume until the Mac is rebooted

Screenshots / Logs

My code to call the LLM:

class LLMClient: @staticmethod def call_llm(prompt, temperature=0.7): payload = { "messages": [ {"content": Prompts.SYSTEM_PROMPT, "role": "system"}, {"content": prompt, "role": "user"} ], "model": Config.LLM_MODEL, "stream": False, "max_tokens": 131072, "frequency_penalty": 0, "presence_penalty": 0, "temperature": temperature, "top_p": 0.95 }

    response = requests.post('http://localhost:1337/v1/chat/completions', json=payload)
    logging.info(f"Received response from LLM: {response.json()}")
    return response.json().get('choices', [{}])[0].get('message', {}).get('content', 'No response from LLM.') 

    the logs:

2024-10-31T14:45:28.608Z [SPECS]::Version: 0.5.7 2024-10-31T14:45:28.609Z [SPECS]::CPUs: [{"model":"Apple M1 Pro","speed":2400,"times":{"user":20660,"nice":0,"sys":20240,"idle":16690,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":21040,"nice":0,"sys":18940,"idle":17620,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":15790,"nice":0,"sys":10250,"idle":31370,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":13690,"nice":0,"sys":9730,"idle":34040,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":12060,"nice":0,"sys":8020,"idle":37450,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":10490,"nice":0,"sys":7130,"idle":39940,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":11040,"nice":0,"sys":5480,"idle":41090,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":9610,"nice":0,"sys":4570,"idle":43450,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":8670,"nice":0,"sys":4160,"idle":44820,"irq":0}},{"model":"Apple M1 Pro","speed":2400,"times":{"user":8090,"nice":0,"sys":4100,"idle":45450,"irq":0}}] 2024-10-31T14:45:28.609Z [SPECS]::Machine: arm64 2024-10-31T14:45:28.609Z [SPECS]::Parallelism: 10 2024-10-31T14:45:28.609Z [SPECS]::Total Mem: 17179869184 2024-10-31T14:45:28.609Z [SPECS]::Free Mem: 3280502784 2024-10-31T14:45:28.609Z [SPECS]::OS Version: Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:15 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6000 2024-10-31T14:45:28.609Z [SPECS]::Endianness: LE 2024-10-31T14:45:28.609Z [SPECS]::OS Release: 24.1.0 2024-10-31T14:45:28.609Z [SPECS]::OS Platform: darwin 2024-10-31T14:45:28.609Z [CORTEX]::Debug: Adding additional dependencies for @janhq/inference-cortex-extension 1.0.20 2024-10-31T14:45:28.609Z [SERVER]::Debug: Starting JAN API server... 2024-10-31T14:45:33.800Z [SERVER]::Debug: JAN API listening at: http://127.0.0.1:1337 2024-10-31T14:45:33.849Z [CORTEX]::CPU information - 10 2024-10-31T14:45:33.849Z [CORTEX]:: Request to kill cortex 2024-10-31T14:45:33.866Z [CORTEX]:: cortex process is terminated 2024-10-31T14:45:33.866Z [CORTEX]:: Spawning cortex subprocess... 2024-10-31T14:45:33.866Z [CORTEX]:: Spawn cortex at path: /Users/antoniomontani/Library/Application Support/Jan/data/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64/cortex-cpp, and args: 1,127.0.0.1,3928 2024-10-31T14:45:33.866Z [CORTEX]::Debug: Cortex engine path: /Users/antoniomontani/Library/Application Support/Jan/data/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64 2024-10-31T14:45:33.866Z [CORTEX] PATH: /usr/bin:/bin:/usr/sbin:/sbin::/Users/antoniomontani/Library/Application Support/Jan/data/engines/@janhq/inference-cortex-extension/1.0.20:/Users/antoniomontani/Library/Application Support/Jan/data/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64 2024-10-31T14:45:33.976Z [CORTEX]:: cortex is ready 2024-10-31T14:45:33.977Z [CORTEX]:: Loading model with params {"cpu_threads":10,"ctx_len":4096,"prompt_template":"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","llama_model_path":"/Users/antoniomontani/Library/Application Support/Jan/data/models/llama3.2-3b-instruct/Llama-3.2-3B-Instruct-Q8_0.gguf","ngl":33,"system_prompt":"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n","user_prompt":"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n","ai_prompt":"<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","model":"llama3.2-3b-instruct"} 2024-10-31T14:45:34.012Z [CORTEX]:: 20241031 14:45:33.899228 UTC 12836 INFO cortex-cpp version: 0.5.0 - main.cc:73 20241031 14:45:33.899653 UTC 12836 INFO Server started, listening at: 127.0.0.1:3928 - main.cc:78 20241031 14:45:33.899654 UTC 12836 INFO Please load your model - main.cc:79 20241031 14:45:33.899656 UTC 12836 INFO Number of thread is:10 - main.cc:86 20241031 14:45:33.981780 UTC 12840 INFO CPU instruction set: fpu = 0| mmx = 0| sse = 0| sse2 = 0| sse3 = 0| ssse3 = 0| sse4_1 = 0| sse4_2 = 0| pclmulqdq = 0| avx = 0| avx2 = 0| avx512_f = 0| avx512_dq = 0| avx512_ifma = 0| avx512_pf = 0| avx512_er = 0| avx512_cd = 0| avx512_bw = 0| has_avx512_vl = 0| has_avx512_vbmi = 0| has_avx512_vbmi2 = 0| avx512_vnni = 0| avx512_bitalg = 0| avx512_vpopcntdq = 0| avx512_4vnniw = 0| avx512_4fmaps = 0| avx512_vp2intersect = 0| aes = 0| f16c = 0| - server.cc:288 20241031 14:45:34.011602 UTC 12840 INFO Loaded engine: cortex.llamacpp - server.cc:314 20241031 14:45:34.011796 UTC 12840 INFO cortex.llamacpp version: 0.1.25 - llama_engine.cc:163 20241031 14:45:34.011899 UTC 12840 INFO Number of parallel is set to 1 - llama_engine.cc:352 20241031 14:45:34.011902 UTC 12840 DEBUG [LoadModelImpl] cache_type: f16 - llama_engine.cc:365 20241031 14:45:34.011945 UTC 12840 DEBUG [LoadModelImpl] Enabled Flash Attention - llama_engine.cc:374 20241031 14:45:34.011951 UTC 12840 DEBUG [LoadModelImpl] stop: null

llama_engine.cc:395 {"timestamp":1730385934,"level":"INFO","function":"LoadModelImpl","line":418,"message":"system info","n_threads":10,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "}

2024-10-31T14:45:34.045Z llama_model_loader: llama_model_loader: - kv 0: llama_model_loader: - kv 1: llama_model_loader: - kv 2: llama_model_loader: - kv 3: llama_model_loader: - kv 4: llama_model_loader: - kv 5: llama_model_loader: - kv 6: llama_model_loader: - kv 7: llama_model_loader: - kv 8: llama_model_loader: - kv 9: llama_model_loader: - kv 10: llama_model_loader: - kv 11: llama_model_loader: - kv 12: llama_model_loader: - kv 13: llama_model_loader: - kv 14: llama_model_loader: - kv 15: llama_model_loader: - kv 16: llama_model_loader: - kv 17: llama_model_loader: - kv 18: llama_model_loader: - kv 19: llama_model_loader: - kv 20: llama_model_loader: - kv 21: llama_model_loader: - kv 22: llama_model_loader: - kv 23: [CORTEX]::Error: llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /Users/antoniomontani/Library/Application Support/Jan/data/models/llama3.2-3b-instruct/Llama-3.2-3B-Instruct-Q8_0.gguf (version GGUF V3 (latest)) Dumping metadata keys/values. Note: KV overrides do not apply in this output. general.architecture str = llama general.type str = model general.name str = Llama 3.2 3B Instruct general.finetune str = Instruct general.basename str = Llama-3.2 general.size_label str = 3B general.license str = llama3.2 general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama.block_count u32 = 28 llama.context_length u32 = 131072 llama.embedding_length u32 = 3072 llama.feed_forward_length u32 = 8192 llama.attention.head_count u32 = 24 llama.attention.head_count_kv u32 = 8 llama.rope.freq_base f32 = 500000.000000 llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama.attention.key_length u32 = 128 llama.attention.value_length u32 = 128 general.file_type u32 = 7 llama.vocab_size u32 = 128256 llama.rope.dimension_count u32 = 128 tokenizer.ggml.model str = gpt2 tokenizer.ggml.pre str = llama-bpe

2024-10-31T14:45:34.058Z [CORTEX]::Error: llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...

2024-10-31T14:45:34.062Z [CORTEX]::Error: llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

2024-10-31T14:45:34.087Z [CORTEX]::Error: llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 30: general.quantization_version u32 = 2 llama_model_loader: - kv 31: quantize.imatrix.file str = /models_out/Llama-3.2-3B-Instruct-GGU... llama_model_loader: - kv 32: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 196 llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q8_0: 197 tensors

2024-10-31T14:45:34.301Z [CORTEX]::Error: llm_load_vocab: special tokens cache size = 256

2024-10-31T14:45:34.331Z [CORTEX]::Error: llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0

2024-10-31T14:45:34.331Z [CORTEX]::Error: llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 3.18 GiB (8.50 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.24 MiB

2024-10-31T14:45:34.381Z [CORTEX]::Error: ggml_backend_metal_log_allocated_size: allocated buffer, size = 3255.92 MiB, ( 3255.98 / 10922.67) llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU

2024-10-31T14:45:34.381Z [CORTEX]::Error: llm_load_tensors: CPU buffer size = 399.23 MiB llm_load_tensors: Metal buffer size = 3255.91 MiB

2024-10-31T14:45:34.381Z [CORTEX]::Error: ........................... 2024-10-31T14:45:34.381Z [CORTEX]::Error: ................................................ 2024-10-31T14:45:34.382Z [CORTEX]::Error: ......

2024-10-31T14:45:34.396Z [CORTEX]::Error: llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating

2024-10-31T14:45:34.396Z [CORTEX]::Error: ggml_metal_init: found device: Apple M1 Pro ggml_metal_init: picking default device: Apple M1 Pro

2024-10-31T14:45:34.397Z [CORTEX]::Error: ggml_metal_init: using embedded metal library

2024-10-31T14:45:34.405Z [CORTEX]::Error: ggml_metal_init: GPU name: Apple M1 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB

2024-10-31T14:45:34.450Z [CORTEX]::Error: llama_kv_cache_init: Metal KV buffer size = 448.00 MiB llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB

2024-10-31T14:45:34.450Z [CORTEX]::Error: llama_new_context_with_model: CPU output buffer size = 0.49 MiB

2024-10-31T14:45:34.451Z [CORTEX]::Error: llama_new_context_with_model: Metal compute buffer size = 1026.00 MiB llama_new_context_with_model: CPU compute buffer size = 56.02 MiB llama_new_context_with_model: graph nodes = 791 llama_new_context_with_model: graph splits = 2

2024-10-31T14:45:36.203Z [CORTEX]:: Load model success with response {} 2024-10-31T14:45:36.204Z [CORTEX]:: Validating model llama3.2-3b-instruct 2024-10-31T14:45:36.205Z [CORTEX]:: Validate model state with response 200 2024-10-31T14:45:36.207Z [CORTEX]:: Validate model state success with response {"model_data":"{\"frequency_penalty\":0.0,\"grammar\":\"\",\"ignore_eos\":false,\"logit_bias\":[],\"min_p\":0.05000000074505806,\"mirostat\":0,\"mirostat_eta\":0.10000000149011612,\"mirostat_tau\":5.0,\"model\":\"/Users/antoniomontani/Library/Application Support/Jan/data/models/llama3.2-3b-instruct/Llama-3.2-3B-Instruct-Q8_0.gguf\",\"n_ctx\":4096,\"n_keep\":0,\"n_predict\":2,\"n_probs\":0,\"penalize_nl\":false,\"penalty_prompt_tokens\":[],\"presence_penalty\":0.0,\"repeat_last_n\":64,\"repeat_penalty\":1.0,\"seed\":4294967295,\"stop\":[],\"stream\":false,\"temperature\":0.800000011920929,\"tfs_z\":1.0,\"top_k\":40,\"top_p\":0.949999988079071,\"typical_p\":1.0,\"use_penalty_prompt_tokens\":false}","model_loaded":true} 2024-10-31T14:45:55.748Z [SERVER]::{"reqId":"req-1","res":{},"req":{"method":"POST","url":"/v1/chat/completions","hostname":"localhost:1337"},"msg":"incoming request"} 2024-10-31T14:45:55.772Z [CORTEX]::Error: check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?

What is your OS?

[X] MacOS
[ ] Windows
[ ] Linux

louis-jan commented 3 weeks ago

Hi @CrazybutSolid, thank you for creating the issue. Could you please help find the log file in the Jan data folder as well. I have a feeling the logs are cut off here in the API Server screen.

louis-jan commented 3 weeks ago

Hi @CrazybutSolid, could you please try to reproduce again and see if there is a dangling process in the background during that. Given that there was a chat/completions running and not cancelled properly. It keeps generating data or so. The port is still being owned by the dangling process. Thus, restarting Jan didn't help which is likely a dangling cortex process issue.

CrazybutSolid commented 3 weeks ago

I think that is indeed the problem. It happens when I stop the server and I start it again. So how do I stop the server and make sure it is completely stopped so I can start it again?

louis-jan commented 3 weeks ago

The workaround is to kill the process manually, which is frustrating. We will be moving to a new cortex.cpp version, hopefully that could be addressed in the next version.

janhq / cortex.cpp