[Bug] internlm2_5模型mlc_llm serve执行异常

l241025097 commented 1 week ago

🐛 Bug

1、使用modelscope模型Shanghai_AI_Laboratory/internlm2_5-20b-chat（download至本地），编译时没找到对应的--conv-template，则配置为LM。在执行mlc_llm serve时报错； 2、使用huggingface模型mlc-ai/internlm2_5-20b-q4f32_1-MLC（download至本地），执行mlc_llm serve时报错。

To Reproduce

Steps to reproduce the behavior:

1、问题1： （1）/opt/miniconda3/envs/python3.11/bin/mlc_llm convert_weight /workspace/models/internlm2_5-20b-chat \ --device cuda:1 \ --quantization q4f32_1 \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC

（2）/opt/miniconda3/envs/python3.11/bin/mlc_llm gen_config /workspace/models/internlm2_5-20b-chat \ --quantization q4f32_1 \ --conv-template LM \ --tensor-parallel-shard 2 \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC

（3）mkdir -p /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs

（4）/opt/miniconda3/envs/python3.11/bin/mlc_llm compile /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/mlc-chat-config.json \ --device cuda \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so

（5）/opt/miniconda3/envs/python3.11/bin/mlc_llm serve \ /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC \ --model-lib /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so \ --mode server \ --host 0.0.0.0

[2024-11-10 04:28:36] INFO auto_device.py:79: Found device: cuda:0 [2024-11-10 04:28:36] INFO auto_device.py:79: Found device: cuda:1 [2024-11-10 04:28:37] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-10 04:28:39] INFO auto_device.py:88: Not found device: metal:0 [2024-11-10 04:28:41] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-10 04:28:42] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-10 04:28:42] INFO auto_device.py:35: Using device: cuda:0 [2024-11-10 04:28:42] INFO engine_base.py:143: Using library model: /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so [2024-11-10 04:28:42] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-11-10 04:28:42] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-11-10 04:28:42] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". thread '' panicked at src/lib.rs:26:50: called Result::unwrap() on an Err value: Error("data did not match any variant of untagged enum ModelWrapper", line: 753179, column: 1) stack backtrace: 0: 0x791f1765519c - ::fmt::h41fa541dc14fbe51 1: 0x791f176a5e70 - core::fmt::write::h0892af1ec116d2e4 2: 0x791f1764a4cd - std::io::Write::write_fmt::hc85c550e5a70f4cf 3: 0x791f17654f84 - std::sys_common::backtrace::print::h5d9aabdcf93aa773 4: 0x791f17657c67 - std::panicking::default_hook::{{closure}}::h6943f7db7ebd9dfa 5: 0x791f176579cf - std::panicking::default_hook::hc843c2a865849d41 6: 0x791f176581a8 - std::panicking::rust_panic_with_hook::hac0a41b89f5ab822 7: 0x791f1765808e - std::panicking::begin_panic_handler::{{closure}}::h1c034067c5755b7e 8: 0x791f17655666 - std::sys_common::backtrace::rust_end_short_backtrace::h3f5e2602c6964099 9: 0x791f17657df2 - rust_begin_unwind 10: 0x791f1729b4c5 - core::panicking::panic_fmt::hcd09b86433080a0a 11: 0x791f1729baf3 - core::result::unwrap_failed::h37e38fafe094d785 12: 0x791f1743f27a - tokenizers_new_from_str 13: 0x791f174369e9 - _ZN10tokenizers9Tokenizer12FromBlobJSONERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE at /workspace/mlc-llm/3rdparty/tokenizers-cpp/src/huggingface_tokenizer.cc:108:63 14: 0x791f17433730 - _ZN3mlc3llm9Tokenizer8FromPathERKN3tvm7runtime6StringESt8optionalINS013TokenizerInfoEE at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:157:57 15: 0x791f17434098 - operator() at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:459:34 16: 0x791f17434098 - run<tvm::runtime::TVMMovableArgValueWithContext> at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1974:11 17: 0x791f17434098 - run<> at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1959:60 18: 0x791f17434098 - unpack_call<mlc::llm::Tokenizer, 1, mlc::llm::<lambda(const tvm::runtime::String&)> > at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1999:46 19: 0x791f17434098 - operator() at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:2059:44 20: 0x791f17434098 - Call at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1394:58 21: 0x791f562adeba - TVMFuncCall 22: 0x791fb6df31a5 - _ZL39pyx_f_3tvm_4_ffi_4_cy3_4core_FuncCallPvP7_objectP8TVMValuePi 23: 0x791fb6df3769 - _ZL76pyx_pw_3tvm_4_ffi_4_cy3_4core_10ObjectBase_3__init_handle_by_constructorP7_objectPKS0lS0 24: 0x579946b8a9cc - _PyObject_VectorcallTstate at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_call.h:92:11 25: 0x579946b8a9cc - PyObject_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:299:12 26: 0x579946b7de36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 27: 0x579946ba14c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 28: 0x579946ba14c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 29: 0x579946ba14c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 30: 0x579946ba886c - _PyObject_FastCallDictTstate at /usr/local/src/conda/python-3.11.8/Objects/call.c:141:15 31: 0x579946ba886c - _PyObject_Call_Prepend at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24 32: 0x579946ba886c - slot_tp_init at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15 33: 0x579946b70303 - type_call at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19 34: 0x579946b70303 - _PyObject_MakeTpCall at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18 35: 0x579946b7de36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 36: 0x579946ba14c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 37: 0x579946ba14c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 38: 0x579946ba14c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 39: 0x579946ba8933 - _PyObject_FastCallDictTstate at /usr/local/src/conda/python-3.11.8/Objects/call.c:152:15 40: 0x579946ba8933 - _PyObject_Call_Prepend at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24 41: 0x579946ba8933 - slot_tp_init at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15 42: 0x579946b70303 - type_call at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19 43: 0x579946b70303 - _PyObject_MakeTpCall at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18 44: 0x579946b7de36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 45: 0x579946ba14c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 46: 0x579946ba14c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 47: 0x579946ba14c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 48: 0x579946bab1e0 - _PyVectorcall_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:257:24 49: 0x579946bab1e0 - _PyObject_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:328:16 50: 0x579946bab1e0 - PyObject_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:355:12 51: 0x579946b82119 - do_call_core at /usr/local/src/conda/python-3.11.8/Python/ceval.c:7349:12 52: 0x579946b82119 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:5376:22 53: 0x579946c3442d - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 54: 0x579946c3442d - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 55: 0x579946c33abf - PyEval_EvalCode at /usr/local/src/conda/python-3.11.8/Python/ceval.c:1148:21 56: 0x579946c52a1a - run_eval_code_obj at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1741:9 57: 0x579946c4e593 - run_mod at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1762:19 58: 0x579946c63930 - pyrun_file at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1657:15 59: 0x579946c632ce - _PyRun_SimpleFileObject at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:440:13 60: 0x579946c62ff4 - _PyRun_AnyFileObject at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:79:15 61: 0x579946c5d6f4 - pymain_run_file_obj at /usr/local/src/conda/python-3.11.8/Modules/main.c:360:15 62: 0x579946c5d6f4 - pymain_run_file at /usr/local/src/conda/python-3.11.8/Modules/main.c:379:15 63: 0x579946c5d6f4 - pymain_run_python at /usr/local/src/conda/python-3.11.8/Modules/main.c:601:21 64: 0x579946c5d6f4 - Py_RunMain at /usr/local/src/conda/python-3.11.8/Modules/main.c:680:5 65: 0x579946c23a77 - Py_BytesMain at /usr/local/src/conda/python-3.11.8/Modules/main.c:734:12 66: 0x791fb74d7d90 - 67: 0x791fb74d7e40 - __libc_start_main 68: 0x579946c2391d - fatal runtime error: failed to initiate panic, error 5 Aborted (core dumped)

2、问题2： /opt/miniconda3/envs/python3.11/bin/mlc_llm serve \ /workspace/models/internlm2_5-20b-q4f32_1-MLC \ --mode server \ --host 0.0.0.0 \ --overrides "tensor_parallel_shards=2;prefill_chunk_size=512;gpu_memory_utilization=0.95;max_num_sequence=32"

[2024-11-10 05:05:55] INFO auto_device.py:79: Found device: cuda:0 [2024-11-10 05:05:55] INFO auto_device.py:79: Found device: cuda:1 [2024-11-10 05:05:56] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-10 05:05:58] INFO auto_device.py:88: Not found device: metal:0 [2024-11-10 05:06:00] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-10 05:06:01] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-10 05:06:01] INFO auto_device.py:35: Using device: cuda:0 [2024-11-10 05:06:01] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-11-10 05:06:01] INFO jit.py:118: Compiling using commands below: [2024-11-10 05:06:01] INFO jit.py:119: /opt/miniconda3/envs/python3.11/bin/python3 -m mlc_llm compile /workspace/models/internlm2_5-20b-q4f32_1-MLC --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'prefill_chunk_size=512;max_batch_size=32;tensor_parallel_shards=2' --device cuda:0 --output /tmp/tmpb33h687o/lib.so [2024-11-10 05:06:03] INFO auto_config.py:70: Found model configuration: /workspace/models/internlm2_5-20b-q4f32_1-MLC/mlc-chat-config.json [2024-11-10 05:06:03] INFO auto_target.py:91: Detecting target device: cuda:0 [2024-11-10 05:06:03] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(32), "arch": "sm_86", "max_threads_per_block": runtime.BoxInt(1024), "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]} [2024-11-10 05:06:03] INFO auto_target.py:110: Found host LLVM triple: x86_64-redhat-linux-gnu [2024-11-10 05:06:03] INFO auto_target.py:111: Found host LLVM CPU: haswell [2024-11-10 05:06:03] INFO auto_target.py:334: Generating code for CUDA architecture: sm_86 [2024-11-10 05:06:03] INFO auto_target.py:335: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90a [2024-11-10 05:06:03] INFO auto_config.py:154: Found model type: internlm2. Use --model-type to override. Compiling with arguments: --config InternLM2Config(vocab_size=92544, hidden_size=6144, num_hidden_layers=48, num_attention_heads=48, num_key_value_heads=8, rms_norm_eps=1e-05, intermediate_size=16384, bias=False, use_cache=True, rope_theta=50000000, pad_token_id=2, bos_token_id=1, eos_token_id=2, context_window_size=262144, prefill_chunk_size=2048, tensor_parallel_shards=1, max_batch_size=80, head_dim=128, kwargs={}) --quantization GroupQuantize(name='q4f32_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float32', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0) --model-type internlm2 --target {"thread_warp_size": runtime.BoxInt(32), "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "haswell", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": runtime.BoxInt(1024), "libs": ["thrust"], "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]} --opt flashinfer=1;cublas_gemm=0;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE --system-lib-prefix "" --output /tmp/tmpb33h687o/lib.so --overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=512;attention_sink_size=None;max_batch_size=32;tensor_parallel_shards=2;pipeline_parallel_stages=None [2024-11-10 05:06:03] INFO config.py:107: Overriding prefill_chunk_size from 2048 to 512 [2024-11-10 05:06:03] INFO config.py:107: Overriding max_batch_size from 80 to 32 [2024-11-10 05:06:03] INFO config.py:107: Overriding tensor_parallel_shards from 1 to 2 [2024-11-10 05:06:03] INFO compile.py:140: Creating model from: InternLM2Config(vocab_size=92544, hidden_size=6144, num_hidden_layers=48, num_attention_heads=48, num_key_value_heads=8, rms_norm_eps=1e-05, intermediate_size=16384, bias=False, use_cache=True, rope_theta=50000000, pad_token_id=2, bos_token_id=1, eos_token_id=2, context_window_size=262144, prefill_chunk_size=2048, tensor_parallel_shards=1, max_batch_size=80, head_dim=128, kwargs={}) [2024-11-10 05:06:03] INFO compile.py:158: Exporting the model to TVM Unity compiler [2024-11-10 05:06:07] INFO compile.py:164: Running optimizations using TVM Unity [2024-11-10 05:06:07] INFO compile.py:185: Registering metadata: {'model_type': 'internlm2', 'quantization': 'q4f32_1', 'context_window_size': 262144, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 512, 'tensor_parallel_shards': 2, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 32} [2024-11-10 05:06:09] INFO pipeline.py:54: Running TVM Relax graph-level optimizations [2024-11-10 05:06:14] INFO pipeline.py:54: Lowering to TVM TIR kernels [2024-11-10 05:06:25] INFO pipeline.py:54: Running TVM TIR-level optimizations [2024-11-10 05:07:00] INFO pipeline.py:54: Running TVM Dlight low-level optimizations [2024-11-10 05:07:02] INFO pipeline.py:54: Lowering to VM bytecode [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function alloc_embedding_tensor: 12.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function argsort_probs: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_decode: 15.05 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_prefill: 71.30 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_verify: 240.75 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function create_tir_paged_kv_cache: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function decode: 0.47 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function embed: 12.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function multinomial_from_uniform: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function prefill: 60.38 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function renormalize_by_top_p: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function sample_with_top_p: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function sampler_take_probs: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function sampler_verify_draft_tokens: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function softmax_with_temperature: 0.00 MB [2024-11-10 05:07:17] INFO pipeline.py:54: Compiling external modules [2024-11-10 05:07:17] INFO pipeline.py:54: Compilation complete! Exporting to disk [2024-11-10 05:07:47] INFO model_metadata.py:95: Total memory usage without KV cache:: 6500.84 MB (Parameters: 6260.09 MB. Temporary buffer: 240.75 MB) [2024-11-10 05:07:47] INFO model_metadata.py:103: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size [2024-11-10 05:07:47] INFO compile.py:207: Generated: /tmp/tmpb33h687o/lib.so [2024-11-10 05:07:48] INFO jit.py:126: Using compiled model lib: /root/.cache/mlc_llm/model_lib/0eb086393737fad474780549d0878131.so [2024-11-10 05:07:48] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-11-10 05:07:48] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-11-10 05:07:48] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size 32 is specified by user, max KV cache token capacity will be set to 8192, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size 32 is specified by user, max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size 32 is specified by user, max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "server". So max batch size is 32, max KV cache token capacity is 172266, prefill chunk size is 512. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 23037.859 MB (Parameters: 6260.086 MB. KVCache: 16226.536 MB. Temporary buffer: 551.237 MB). The actual usage might be slightly larger than the estimated number. [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #0] Loading model to device: cuda:0 [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #1] Loading model to device: cuda:1 [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:175: Loading parameters... [==================================================================================================>] [485/485] [05:08:04] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:203: Loading done. Time used: Loading 11.741 s Preprocessing 2.289 s. terminate called after throwing an instance of 'tvm::runtime::InternalError' what(): [05:08:05] /workspace/tvm/src/runtime/cuda/cuda_device_api.cc:145: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory Stack trace: 0: _ZN3tvm7runtime6deta 1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType) 2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const, DLDataType, tvm::runtime::Optional) 3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional) 4: tvm::runtime::relax_vm::PagedAttentionKVCacheObj::PagedAttentionKVCacheObj(long, long, long, long, long, long, long, long, long, bool, tvm::runtime::relax_vm::RoPEMode, double, double, tvm::runtime::Optional, DLDataType, DLDevice, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional) 5: tvm::runtime::relax_vm::__mk_TVM2::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::relax_vm::__mk_TVM2, tvm::runtime::TVMRetValue) const [clone .constprop.0] 6: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame, tvm::runtime::relax_vm::Instruction) 7: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop() 8: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator > const&) 9: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 10: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 11: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&) 12: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*) 13: execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:104 14: 0x0000755e4b02fac2 15: __clone 16: 0xffffffffffffffff

Aborted (core dumped) root@34981ae00917:/# Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 58, in main() File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 53, in main worker_func(worker_id, num_workers, num_groups, reader, writer) File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 284, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err tvm.error.InternalError: Traceback (most recent call last): 14: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (int, int, int, long, long)>::AssignTypedLambda<void ()(int, int, int, long, long)>(void ()(int, int, int, long, long), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 13: tvm::runtime::WorkerProcess(int, int, int, long, long) 12: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker) 11: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&) 10: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 9: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 8: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator > const&) 7: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop() 6: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame, tvm::runtime::relax_vm::Instruction) 5: tvm::runtime::relax_vm::mk_TVM2::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::relax_vm::__mk_TVM2, tvm::runtime::TVMRetValue) const [clone .constprop.0] 4: tvm::runtime::relax_vm::PagedAttentionKVCacheObj::PagedAttentionKVCacheObj(long, long, long, long, long, long, long, long, long, bool, tvm::runtime::relax_vm::RoPEMode, double, double, tvm::runtime::Optional, DLDataType, DLDevice, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional) 3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional) 2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const, DLDataType, tvm::runtime::Optional) 1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/cuda/cuda_device_api.cc", line 145 InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

Expected behavior

Environment

Platform: CUDA
Operating system: Ubuntu
Device: RTX 3090
How you installed MLC-LLM: docker
How you installed TVM-Unity: docker
Python version: 3.11
GPU driver version: 550.120
CUDA/cuDNN version: 12.3.2-cudnn9
TVM Unity Hash Tag: USE_NVTX: OFF USE_GTEST: AUTO SUMMARIZE: OFF TVM_DEBUG_WITH_ABI_CHANGE: OFF USE_IOS_RPC: OFF USE_MSC: OFF USE_ETHOSU: CUDA_VERSION: 12.3 USE_LIBBACKTRACE: AUTO DLPACK_PATH: 3rdparty/dlpack/include USE_TENSORRT_CODEGEN: OFF USE_THRUST: ON USE_TARGET_ONNX: OFF USE_AOT_EXECUTOR: ON BUILD_DUMMY_LIBTVM: OFF USE_CUDNN: OFF USE_TENSORRT_RUNTIME: OFF USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF USE_CCACHE: AUTO USE_ARM_COMPUTE_LIB: OFF USE_CPP_RTVM: USE_OPENCL_GTEST: /path/to/opencl/gtest TVM_LOG_BEFORE_THROW: OFF USE_MKL: OFF USE_PT_TVMDSOOP: OFF MLIR_VERSION: NOT-FOUND USE_CLML: OFF USE_STACKVM_RUNTIME: OFF USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF ROCM_PATH: /opt/rocm USE_DNNL: OFF USE_MSCCL: OFF USE_NNAPI_RUNTIME: OFF USE_VITIS_AI: OFF USE_MLIR: OFF USE_RCCL: OFF USE_LLVM: llvm-config --ignore-libllvm --link-static USE_VERILATOR: OFF USE_TF_TVMDSOOP: OFF USE_THREADS: ON USE_MSVC_MT: OFF BACKTRACE_ON_SEGFAULT: OFF USE_GRAPH_EXECUTOR: ON USE_NCCL: ON USE_ROCBLAS: OFF GIT_COMMIT_HASH: 35a317f387249f9592d176c9f12ddf44e2dc3853 USE_VULKAN: ON USE_RUST_EXT: OFF USE_CUTLASS: ON USE_CPP_RPC: OFF USE_HEXAGON: OFF USE_CUSTOM_LOGGING: OFF USE_UMA: OFF USE_FALLBACK_STL_MAP: OFF USE_SORT: ON USE_RTTI: ON GIT_COMMIT_TIME: 2024-10-24 20:06:59 -0400 USE_HIPBLAS: OFF USE_HEXAGON_SDK: /path/to/sdk USE_BLAS: none USE_ETHOSN: OFF USE_LIBTORCH: OFF USE_RANDOM: ON USE_CUDA: ON USE_COREML: OFF USE_AMX: OFF BUILD_STATIC_RUNTIME: OFF USE_CMSISNN: OFF USE_KHRONOS_SPIRV: OFF USE_CLML_GRAPH_EXECUTOR: OFF USE_TFLITE: OFF USE_HEXAGON_GTEST: /path/to/hexagon/gtest PICOJSON_PATH: 3rdparty/picojson USE_OPENCL_ENABLE_HOST_PTR: OFF INSTALL_DEV: OFF USE_PROFILER: ON USE_NNPACK: OFF LLVM_VERSION: 17.0.6 USE_MRVL: OFF USE_OPENCL: OFF COMPILER_RT_PATH: 3rdparty/compiler-rt USE_NNAPI_CODEGEN: OFF RANG_PATH: 3rdparty/rang/include USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF USE_OPENMP: OFF USE_BNNS: OFF USE_FLASHINFER: ON USE_CUBLAS: ON USE_METAL: OFF USE_MICRO_STANDALONE_RUNTIME: OFF USE_HEXAGON_EXTERNAL_LIBS: OFF USE_ALTERNATIVE_LINKER: AUTO USE_BYODT_POSIT: OFF USE_NVSHMEM: OFF USE_HEXAGON_RPC: OFF USE_MICRO: OFF DMLC_PATH: 3rdparty/dmlc-core/include INDEX_DEFAULT_I64: ON USE_RELAY_DEBUG: OFF USE_RPC: ON USE_TENSORFLOW_PATH: none TVM_CLML_VERSION: USE_MIOPEN: OFF USE_ROCM: OFF USE_PAPI: OFF USE_CURAND: OFF TVM_CXX_COMPILER_PATH: /opt/rh/gcc-toolset-11/root/usr/bin/c++ HIDE_PRIVATE_SYMBOLS: ON
Any other relevant information:

Additional context

modules.tar.gz
- miniconda.sh
- mlc_llm_nightly_cu123-0.18.dev58-cp311-cp311-manylinux_2_28_x86_64.whl
- mlc_ai_nightly_cu123-0.18.dev225-cp311-cp311-manylinux_2_28_x86_64.whl
Dockerfile FROM dockerpull.org/nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04 RUN apt-get clean && apt-get install -y apt-transport-https RUN apt-get update && apt-get install -y git RUN apt-get install -y wget RUN apt-get install -y vim RUN apt-get install -y iputils-ping RUN apt-get install -y telnet RUN apt-get install -y traceroute RUN apt-get install -y libterm-readkey-perl RUN apt-get install -y locales RUN apt-get install -y locales-all && locale-gen en_US.UTF-8 RUN echo ":set mouse-=a" > ~/.vimrc ENV LANG=en_US.UTF-806 ENV LANGUAGE=en_US:en ENV LC_ALL=en_US.UTF-8 ENV TZ=Asia/Shanghai ENV DEBIAN_FRONTEND=noninteractive RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime RUN apt-get install -y tzdata && dpkg-reconfigure --frontend noninteractive tzdata RUN apt-get install cron rsyslog -y RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 RUN apt-get update && apt-get install -y libgl1 RUN mkdir -p /workspace ADD modules.tar.gz /workspace RUN chmod +x /workspace/modules/miniconda.sh RUN bash /workspace/modules/miniconda.sh -b -u -p /opt/miniconda3 RUN /opt/miniconda3/bin/conda create -n python3.11 python=3.11 RUN /opt/miniconda3/bin/conda install -n python3.11 -c conda-forge git-lfs RUN /opt/miniconda3/bin/conda install -n python3.11 -c conda-forge libgcc-ng RUN /opt/miniconda3/envs/python3.11/bin/python3 -m pip install --upgrade pip RUN /opt/miniconda3/envs/python3.11/bin/python3 -m pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple RUN /opt/miniconda3/envs/python3.11/bin/python3 -m pip install /workspace/modules/mlc_llm_nightly_cu123-0.18.dev58-cp311-cp311-manylinux_2_28_x86_64.whl RUN /opt/miniconda3/envs/python3.11/bin/python3 -m pip install /workspace/modules/mlc_ai_nightly_cu123-0.18.dev225-cp311-cp311-manylinux_2_28_x86_64.whl RUN /opt/miniconda3/bin/conda install -n python3.11 -c conda-forge sentencepiece RUN /opt/miniconda3/bin/conda install -n python3.11 -c conda-forge protobuf
docker run docker run -it --rm --gpus all -v /path/to/models:/workspace/models -p 8000:8000 --entrypoint /bin/bash --ipc host image_name/mlc_llm:12.3

Hzfengsy commented 6 days ago

InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

:)

l241025097 commented 6 days ago

InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

:)

max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user.

当我减少 prefill chunk size，max KV cache token capacity就自动增加，所以一直都是 oom。

另外，可以看看问题 1。

Hzfengsy commented 5 days ago

Please check the config args gpu_memory_utilization and may be set to a smaller number.

As for Q1, try to use template chatlm. see ref

l241025097 commented 5 days ago

Please check the config args gpu_memory_utilization and may be set to a smaller number.

As for Q1, try to use template chatlm. see ref

非常感谢，问题2解决了，gpu_memory_utilization设置到0.7以下就能成功执行，到0.8就不行。但问题1：我修改了template chatlm，在mlc_llm serve时仍然报错：

[2024-11-12 03:43:06] INFO auto_device.py:79: Found device: cuda:0 [2024-11-12 03:43:06] INFO auto_device.py:79: Found device: cuda:1 [2024-11-12 03:43:08] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-12 03:43:09] INFO auto_device.py:88: Not found device: metal:0 [2024-11-12 03:43:11] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-12 03:43:12] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-12 03:43:12] INFO auto_device.py:35: Using device: cuda:0 [2024-11-12 03:43:12] INFO engine_base.py:143: Using library model: /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so [2024-11-12 03:43:12] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-11-12 03:43:12] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-11-12 03:43:12] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". thread '' panicked at src/lib.rs:26:50: called Result::unwrap() on an Err value: Error("data did not match any variant of untagged enum ModelWrapper", line: 753179, column: 1) stack backtrace: 0: 0x74b3b405519c - ::fmt::h41fa541dc14fbe51 1: 0x74b3b40a5e70 - core::fmt::write::h0892af1ec116d2e4 2: 0x74b3b404a4cd - std::io::Write::write_fmt::hc85c550e5a70f4cf 3: 0x74b3b4054f84 - std::sys_common::backtrace::print::h5d9aabdcf93aa773 4: 0x74b3b4057c67 - std::panicking::default_hook::{{closure}}::h6943f7db7ebd9dfa 5: 0x74b3b40579cf - std::panicking::default_hook::hc843c2a865849d41 6: 0x74b3b40581a8 - std::panicking::rust_panic_with_hook::hac0a41b89f5ab822 7: 0x74b3b405808e - std::panicking::begin_panic_handler::{{closure}}::h1c034067c5755b7e 8: 0x74b3b4055666 - std::sys_common::backtrace::rust_end_short_backtrace::h3f5e2602c6964099 9: 0x74b3b4057df2 - rust_begin_unwind 10: 0x74b3b3c9b4c5 - core::panicking::panic_fmt::hcd09b86433080a0a 11: 0x74b3b3c9baf3 - core::result::unwrap_failed::h37e38fafe094d785 12: 0x74b3b3e3f27a - tokenizers_new_from_str 13: 0x74b3b3e369e9 - _ZN10tokenizers9Tokenizer12FromBlobJSONERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE at /workspace/mlc-llm/3rdparty/tokenizers-cpp/src/huggingface_tokenizer.cc:108:63 14: 0x74b3b3e33730 - _ZN3mlc3llm9Tokenizer8FromPathERKN3tvm7runtime6StringESt8optionalINS013TokenizerInfoEE at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:157:57 15: 0x74b3b3e34098 - operator() at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:459:34 16: 0x74b3b3e34098 - run<tvm::runtime::TVMMovableArgValueWithContext> at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1974:11 17: 0x74b3b3e34098 - run<> at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1959:60 18: 0x74b3b3e34098 - unpack_call<mlc::llm::Tokenizer, 1, mlc::llm::<lambda(const tvm::runtime::String&)> > at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1999:46 19: 0x74b3b3e34098 - operator() at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:2059:44 20: 0x74b3b3e34098 - Call at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1394:58 21: 0x74b3f2cadeba - TVMFuncCall 22: 0x74b4536ec1a5 - _ZL39pyx_f_3tvm_4_ffi_4_cy3_4core_FuncCallPvP7_objectP8TVMValuePi 23: 0x74b4536ec769 - _ZL76pyx_pw_3tvm_4_ffi_4_cy3_4core_10ObjectBase_3__init_handle_by_constructorP7_objectPKS0lS0 24: 0x652d2438e9cc - _PyObject_VectorcallTstate at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_call.h:92:11 25: 0x652d2438e9cc - PyObject_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:299:12 26: 0x652d24381e36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 27: 0x652d243a54c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 28: 0x652d243a54c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 29: 0x652d243a54c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 30: 0x652d243ac86c - _PyObject_FastCallDictTstate at /usr/local/src/conda/python-3.11.8/Objects/call.c:141:15 31: 0x652d243ac86c - _PyObject_Call_Prepend at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24 32: 0x652d243ac86c - slot_tp_init at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15 33: 0x652d24374303 - type_call at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19 34: 0x652d24374303 - _PyObject_MakeTpCall at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18 35: 0x652d24381e36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 36: 0x652d243a54c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 37: 0x652d243a54c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 38: 0x652d243a54c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 39: 0x652d243ac933 - _PyObject_FastCallDictTstate at /usr/local/src/conda/python-3.11.8/Objects/call.c:152:15 40: 0x652d243ac933 - _PyObject_Call_Prepend at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24 41: 0x652d243ac933 - slot_tp_init at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15 42: 0x652d24374303 - type_call at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19 43: 0x652d24374303 - _PyObject_MakeTpCall at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18 44: 0x652d24381e36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 45: 0x652d243a54c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 46: 0x652d243a54c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 47: 0x652d243a54c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 48: 0x652d243af1e0 - _PyVectorcall_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:257:24 49: 0x652d243af1e0 - _PyObject_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:328:16 50: 0x652d243af1e0 - PyObject_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:355:12 51: 0x652d24386119 - do_call_core at /usr/local/src/conda/python-3.11.8/Python/ceval.c:7349:12 52: 0x652d24386119 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:5376:22 53: 0x652d2443842d - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 54: 0x652d2443842d - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 55: 0x652d24437abf - PyEval_EvalCode at /usr/local/src/conda/python-3.11.8/Python/ceval.c:1148:21 56: 0x652d24456a1a - run_eval_code_obj at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1741:9 57: 0x652d24452593 - run_mod at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1762:19 58: 0x652d24467930 - pyrun_file at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1657:15 59: 0x652d244672ce - _PyRun_SimpleFileObject at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:440:13 60: 0x652d24466ff4 - _PyRun_AnyFileObject at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:79:15 61: 0x652d244616f4 - pymain_run_file_obj at /usr/local/src/conda/python-3.11.8/Modules/main.c:360:15 62: 0x652d244616f4 - pymain_run_file at /usr/local/src/conda/python-3.11.8/Modules/main.c:379:15 63: 0x652d244616f4 - pymain_run_python at /usr/local/src/conda/python-3.11.8/Modules/main.c:601:21 64: 0x652d244616f4 - Py_RunMain at /usr/local/src/conda/python-3.11.8/Modules/main.c:680:5 65: 0x652d24427a77 - Py_BytesMain at /usr/local/src/conda/python-3.11.8/Modules/main.c:734:12 66: 0x74b453dd0d90 - 67: 0x74b453dd0e40 - __libc_start_main 68: 0x652d2442791d - fatal runtime error: failed to initiate panic, error 5 Aborted (core dumped)

MasterJH5574 commented 2 days ago

@l241025097 Thank you for reporting. We'll look into the second issue you mentioned.

MasterJH5574 commented 1 day ago

Hi @l241025097, we have fixed this issue. Please upgrade the mlc python package to the latest nightly and try again, thanks!

mlc-ai / mlc-llm