mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.2k stars 1.58k forks source link

[Bug] internlm2_5模型mlc_llm serve执行异常 #3016

Open l241025097 opened 1 week ago

l241025097 commented 1 week ago

🐛 Bug

1、使用modelscope模型Shanghai_AI_Laboratory/internlm2_5-20b-chat(download至本地),编译时没找到对应的--conv-template,则配置为LM。在执行mlc_llm serve时报错; 2、使用huggingface模型mlc-ai/internlm2_5-20b-q4f32_1-MLC(download至本地),执行mlc_llm serve时报错。

To Reproduce

Steps to reproduce the behavior:

1、问题1: (1)/opt/miniconda3/envs/python3.11/bin/mlc_llm convert_weight /workspace/models/internlm2_5-20b-chat \ --device cuda:1 \ --quantization q4f32_1 \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC

(2)/opt/miniconda3/envs/python3.11/bin/mlc_llm gen_config /workspace/models/internlm2_5-20b-chat \ --quantization q4f32_1 \ --conv-template LM \ --tensor-parallel-shard 2 \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC

(3)mkdir -p /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs

(4)/opt/miniconda3/envs/python3.11/bin/mlc_llm compile /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/mlc-chat-config.json \ --device cuda \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so

(5)/opt/miniconda3/envs/python3.11/bin/mlc_llm serve \ /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC \ --model-lib /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so \ --mode server \ --host 0.0.0.0

[2024-11-10 04:28:36] INFO auto_device.py:79: Found device: cuda:0 [2024-11-10 04:28:36] INFO auto_device.py:79: Found device: cuda:1 [2024-11-10 04:28:37] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-10 04:28:39] INFO auto_device.py:88: Not found device: metal:0 [2024-11-10 04:28:41] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-10 04:28:42] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-10 04:28:42] INFO auto_device.py:35: Using device: cuda:0 [2024-11-10 04:28:42] INFO engine_base.py:143: Using library model: /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so [2024-11-10 04:28:42] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-11-10 04:28:42] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-11-10 04:28:42] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". thread '' panicked at src/lib.rs:26:50: called Result::unwrap() on an Err value: Error("data did not match any variant of untagged enum ModelWrapper", line: 753179, column: 1) stack backtrace: 0: 0x791f1765519c - ::fmt::h41fa541dc14fbe51 1: 0x791f176a5e70 - core::fmt::write::h0892af1ec116d2e4 2: 0x791f1764a4cd - std::io::Write::write_fmt::hc85c550e5a70f4cf 3: 0x791f17654f84 - std::sys_common::backtrace::print::h5d9aabdcf93aa773 4: 0x791f17657c67 - std::panicking::default_hook::{{closure}}::h6943f7db7ebd9dfa 5: 0x791f176579cf - std::panicking::default_hook::hc843c2a865849d41 6: 0x791f176581a8 - std::panicking::rust_panic_with_hook::hac0a41b89f5ab822 7: 0x791f1765808e - std::panicking::begin_panic_handler::{{closure}}::h1c034067c5755b7e 8: 0x791f17655666 - std::sys_common::backtrace::rust_end_short_backtrace::h3f5e2602c6964099 9: 0x791f17657df2 - rust_begin_unwind 10: 0x791f1729b4c5 - core::panicking::panic_fmt::hcd09b86433080a0a 11: 0x791f1729baf3 - core::result::unwrap_failed::h37e38fafe094d785 12: 0x791f1743f27a - tokenizers_new_from_str 13: 0x791f174369e9 - _ZN10tokenizers9Tokenizer12FromBlobJSONERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE at /workspace/mlc-llm/3rdparty/tokenizers-cpp/src/huggingface_tokenizer.cc:108:63 14: 0x791f17433730 - _ZN3mlc3llm9Tokenizer8FromPathERKN3tvm7runtime6StringESt8optionalINS013TokenizerInfoEE at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:157:57 15: 0x791f17434098 - operator() at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:459:34 16: 0x791f17434098 - run<tvm::runtime::TVMMovableArgValueWithContext> at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1974:11 17: 0x791f17434098 - run<> at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1959:60 18: 0x791f17434098 - unpack_call<mlc::llm::Tokenizer, 1, mlc::llm::<lambda(const tvm::runtime::String&)> > at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1999:46 19: 0x791f17434098 - operator() at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:2059:44 20: 0x791f17434098 - Call at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1394:58 21: 0x791f562adeba - TVMFuncCall 22: 0x791fb6df31a5 - _ZL39pyx_f_3tvm_4_ffi_4_cy3_4core_FuncCallPvP7_objectP8TVMValuePi 23: 0x791fb6df3769 - _ZL76pyx_pw_3tvm_4_ffi_4_cy3_4core_10ObjectBase_3__init_handle_by_constructorP7_objectPKS0lS0 24: 0x579946b8a9cc - _PyObject_VectorcallTstate at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_call.h:92:11 25: 0x579946b8a9cc - PyObject_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:299:12 26: 0x579946b7de36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 27: 0x579946ba14c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 28: 0x579946ba14c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 29: 0x579946ba14c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 30: 0x579946ba886c - _PyObject_FastCallDictTstate at /usr/local/src/conda/python-3.11.8/Objects/call.c:141:15 31: 0x579946ba886c - _PyObject_Call_Prepend at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24 32: 0x579946ba886c - slot_tp_init at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15 33: 0x579946b70303 - type_call at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19 34: 0x579946b70303 - _PyObject_MakeTpCall at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18 35: 0x579946b7de36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 36: 0x579946ba14c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 37: 0x579946ba14c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 38: 0x579946ba14c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 39: 0x579946ba8933 - _PyObject_FastCallDictTstate at /usr/local/src/conda/python-3.11.8/Objects/call.c:152:15 40: 0x579946ba8933 - _PyObject_Call_Prepend at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24 41: 0x579946ba8933 - slot_tp_init at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15 42: 0x579946b70303 - type_call at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19 43: 0x579946b70303 - _PyObject_MakeTpCall at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18 44: 0x579946b7de36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 45: 0x579946ba14c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 46: 0x579946ba14c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 47: 0x579946ba14c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 48: 0x579946bab1e0 - _PyVectorcall_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:257:24 49: 0x579946bab1e0 - _PyObject_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:328:16 50: 0x579946bab1e0 - PyObject_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:355:12 51: 0x579946b82119 - do_call_core at /usr/local/src/conda/python-3.11.8/Python/ceval.c:7349:12 52: 0x579946b82119 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:5376:22 53: 0x579946c3442d - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 54: 0x579946c3442d - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 55: 0x579946c33abf - PyEval_EvalCode at /usr/local/src/conda/python-3.11.8/Python/ceval.c:1148:21 56: 0x579946c52a1a - run_eval_code_obj at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1741:9 57: 0x579946c4e593 - run_mod at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1762:19 58: 0x579946c63930 - pyrun_file at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1657:15 59: 0x579946c632ce - _PyRun_SimpleFileObject at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:440:13 60: 0x579946c62ff4 - _PyRun_AnyFileObject at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:79:15 61: 0x579946c5d6f4 - pymain_run_file_obj at /usr/local/src/conda/python-3.11.8/Modules/main.c:360:15 62: 0x579946c5d6f4 - pymain_run_file at /usr/local/src/conda/python-3.11.8/Modules/main.c:379:15 63: 0x579946c5d6f4 - pymain_run_python at /usr/local/src/conda/python-3.11.8/Modules/main.c:601:21 64: 0x579946c5d6f4 - Py_RunMain at /usr/local/src/conda/python-3.11.8/Modules/main.c:680:5 65: 0x579946c23a77 - Py_BytesMain at /usr/local/src/conda/python-3.11.8/Modules/main.c:734:12 66: 0x791fb74d7d90 - 67: 0x791fb74d7e40 - __libc_start_main 68: 0x579946c2391d - fatal runtime error: failed to initiate panic, error 5 Aborted (core dumped)

2、问题2: /opt/miniconda3/envs/python3.11/bin/mlc_llm serve \ /workspace/models/internlm2_5-20b-q4f32_1-MLC \ --mode server \ --host 0.0.0.0 \ --overrides "tensor_parallel_shards=2;prefill_chunk_size=512;gpu_memory_utilization=0.95;max_num_sequence=32"

[2024-11-10 05:05:55] INFO auto_device.py:79: Found device: cuda:0 [2024-11-10 05:05:55] INFO auto_device.py:79: Found device: cuda:1 [2024-11-10 05:05:56] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-10 05:05:58] INFO auto_device.py:88: Not found device: metal:0 [2024-11-10 05:06:00] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-10 05:06:01] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-10 05:06:01] INFO auto_device.py:35: Using device: cuda:0 [2024-11-10 05:06:01] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-11-10 05:06:01] INFO jit.py:118: Compiling using commands below: [2024-11-10 05:06:01] INFO jit.py:119: /opt/miniconda3/envs/python3.11/bin/python3 -m mlc_llm compile /workspace/models/internlm2_5-20b-q4f32_1-MLC --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'prefill_chunk_size=512;max_batch_size=32;tensor_parallel_shards=2' --device cuda:0 --output /tmp/tmpb33h687o/lib.so [2024-11-10 05:06:03] INFO auto_config.py:70: Found model configuration: /workspace/models/internlm2_5-20b-q4f32_1-MLC/mlc-chat-config.json [2024-11-10 05:06:03] INFO auto_target.py:91: Detecting target device: cuda:0 [2024-11-10 05:06:03] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(32), "arch": "sm_86", "max_threads_per_block": runtime.BoxInt(1024), "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]} [2024-11-10 05:06:03] INFO auto_target.py:110: Found host LLVM triple: x86_64-redhat-linux-gnu [2024-11-10 05:06:03] INFO auto_target.py:111: Found host LLVM CPU: haswell [2024-11-10 05:06:03] INFO auto_target.py:334: Generating code for CUDA architecture: sm_86 [2024-11-10 05:06:03] INFO auto_target.py:335: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90a [2024-11-10 05:06:03] INFO auto_config.py:154: Found model type: internlm2. Use --model-type to override. Compiling with arguments: --config InternLM2Config(vocab_size=92544, hidden_size=6144, num_hidden_layers=48, num_attention_heads=48, num_key_value_heads=8, rms_norm_eps=1e-05, intermediate_size=16384, bias=False, use_cache=True, rope_theta=50000000, pad_token_id=2, bos_token_id=1, eos_token_id=2, context_window_size=262144, prefill_chunk_size=2048, tensor_parallel_shards=1, max_batch_size=80, head_dim=128, kwargs={}) --quantization GroupQuantize(name='q4f32_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float32', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0) --model-type internlm2 --target {"thread_warp_size": runtime.BoxInt(32), "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "haswell", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": runtime.BoxInt(1024), "libs": ["thrust"], "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]} --opt flashinfer=1;cublas_gemm=0;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE --system-lib-prefix "" --output /tmp/tmpb33h687o/lib.so --overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=512;attention_sink_size=None;max_batch_size=32;tensor_parallel_shards=2;pipeline_parallel_stages=None [2024-11-10 05:06:03] INFO config.py:107: Overriding prefill_chunk_size from 2048 to 512 [2024-11-10 05:06:03] INFO config.py:107: Overriding max_batch_size from 80 to 32 [2024-11-10 05:06:03] INFO config.py:107: Overriding tensor_parallel_shards from 1 to 2 [2024-11-10 05:06:03] INFO compile.py:140: Creating model from: InternLM2Config(vocab_size=92544, hidden_size=6144, num_hidden_layers=48, num_attention_heads=48, num_key_value_heads=8, rms_norm_eps=1e-05, intermediate_size=16384, bias=False, use_cache=True, rope_theta=50000000, pad_token_id=2, bos_token_id=1, eos_token_id=2, context_window_size=262144, prefill_chunk_size=2048, tensor_parallel_shards=1, max_batch_size=80, head_dim=128, kwargs={}) [2024-11-10 05:06:03] INFO compile.py:158: Exporting the model to TVM Unity compiler [2024-11-10 05:06:07] INFO compile.py:164: Running optimizations using TVM Unity [2024-11-10 05:06:07] INFO compile.py:185: Registering metadata: {'model_type': 'internlm2', 'quantization': 'q4f32_1', 'context_window_size': 262144, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 512, 'tensor_parallel_shards': 2, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 32} [2024-11-10 05:06:09] INFO pipeline.py:54: Running TVM Relax graph-level optimizations [2024-11-10 05:06:14] INFO pipeline.py:54: Lowering to TVM TIR kernels [2024-11-10 05:06:25] INFO pipeline.py:54: Running TVM TIR-level optimizations [2024-11-10 05:07:00] INFO pipeline.py:54: Running TVM Dlight low-level optimizations [2024-11-10 05:07:02] INFO pipeline.py:54: Lowering to VM bytecode [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function alloc_embedding_tensor: 12.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function argsort_probs: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_decode: 15.05 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_prefill: 71.30 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_verify: 240.75 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function create_tir_paged_kv_cache: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function decode: 0.47 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function embed: 12.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function multinomial_from_uniform: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function prefill: 60.38 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function renormalize_by_top_p: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function sample_with_top_p: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function sampler_take_probs: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function sampler_verify_draft_tokens: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Function softmax_with_temperature: 0.00 MB [2024-11-10 05:07:17] INFO pipeline.py:54: Compiling external modules [2024-11-10 05:07:17] INFO pipeline.py:54: Compilation complete! Exporting to disk [2024-11-10 05:07:47] INFO model_metadata.py:95: Total memory usage without KV cache:: 6500.84 MB (Parameters: 6260.09 MB. Temporary buffer: 240.75 MB) [2024-11-10 05:07:47] INFO model_metadata.py:103: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size [2024-11-10 05:07:47] INFO compile.py:207: Generated: /tmp/tmpb33h687o/lib.so [2024-11-10 05:07:48] INFO jit.py:126: Using compiled model lib: /root/.cache/mlc_llm/model_lib/0eb086393737fad474780549d0878131.so [2024-11-10 05:07:48] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-11-10 05:07:48] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-11-10 05:07:48] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size 32 is specified by user, max KV cache token capacity will be set to 8192, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size 32 is specified by user, max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size 32 is specified by user, max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "server". So max batch size is 32, max KV cache token capacity is 172266, prefill chunk size is 512. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 23037.859 MB (Parameters: 6260.086 MB. KVCache: 16226.536 MB. Temporary buffer: 551.237 MB). The actual usage might be slightly larger than the estimated number. [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #0] Loading model to device: cuda:0 [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #1] Loading model to device: cuda:1 [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:175: Loading parameters... [==================================================================================================>] [485/485] [05:08:04] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:203: Loading done. Time used: Loading 11.741 s Preprocessing 2.289 s. terminate called after throwing an instance of 'tvm::runtime::InternalError' what(): [05:08:05] /workspace/tvm/src/runtime/cuda/cuda_device_api.cc:145: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory Stack trace: 0: _ZN3tvm7runtime6deta 1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType) 2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const, DLDataType, tvm::runtime::Optional) 3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional) 4: tvm::runtime::relax_vm::PagedAttentionKVCacheObj::PagedAttentionKVCacheObj(long, long, long, long, long, long, long, long, long, bool, tvm::runtime::relax_vm::RoPEMode, double, double, tvm::runtime::Optional, DLDataType, DLDevice, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional) 5: tvm::runtime::relax_vm::__mk_TVM2::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::relax_vm::__mk_TVM2, tvm::runtime::TVMRetValue) const [clone .constprop.0] 6: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame, tvm::runtime::relax_vm::Instruction) 7: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop() 8: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator > const&) 9: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 10: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 11: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&) 12: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*) 13: execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:104 14: 0x0000755e4b02fac2 15: __clone 16: 0xffffffffffffffff

Aborted (core dumped) root@34981ae00917:/# Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 58, in main() File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 53, in main worker_func(worker_id, num_workers, num_groups, reader, writer) File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 284, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err tvm.error.InternalError: Traceback (most recent call last): 14: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (int, int, int, long, long)>::AssignTypedLambda<void ()(int, int, int, long, long)>(void ()(int, int, int, long, long), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 13: tvm::runtime::WorkerProcess(int, int, int, long, long) 12: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker) 11: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&) 10: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 9: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 8: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator > const&) 7: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop() 6: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame, tvm::runtime::relax_vm::Instruction) 5: tvm::runtime::relax_vm::mk_TVM2::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::relax_vm::__mk_TVM2, tvm::runtime::TVMRetValue) const [clone .constprop.0] 4: tvm::runtime::relax_vm::PagedAttentionKVCacheObj::PagedAttentionKVCacheObj(long, long, long, long, long, long, long, long, long, bool, tvm::runtime::relax_vm::RoPEMode, double, double, tvm::runtime::Optional, DLDataType, DLDevice, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional) 3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional) 2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const, DLDataType, tvm::runtime::Optional) 1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/cuda/cuda_device_api.cc", line 145 InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

Expected behavior

Environment

Additional context

Hzfengsy commented 6 days ago

InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

:)

l241025097 commented 6 days ago

InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

:)

max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user.

当我减少 prefill chunk size,max KV cache token capacity就自动增加,所以一直都是 oom。

另外,可以看看问题 1。

Hzfengsy commented 5 days ago

Please check the config args gpu_memory_utilization and may be set to a smaller number.

As for Q1, try to use template chatlm. see ref

l241025097 commented 5 days ago

Please check the config args gpu_memory_utilization and may be set to a smaller number.

As for Q1, try to use template chatlm. see ref

非常感谢,问题2解决了,gpu_memory_utilization设置到0.7以下就能成功执行,到0.8就不行。但问题1:我修改了template chatlm,在mlc_llm serve时仍然报错:

[2024-11-12 03:43:06] INFO auto_device.py:79: Found device: cuda:0 [2024-11-12 03:43:06] INFO auto_device.py:79: Found device: cuda:1 [2024-11-12 03:43:08] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-12 03:43:09] INFO auto_device.py:88: Not found device: metal:0 [2024-11-12 03:43:11] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-12 03:43:12] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-12 03:43:12] INFO auto_device.py:35: Using device: cuda:0 [2024-11-12 03:43:12] INFO engine_base.py:143: Using library model: /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so [2024-11-12 03:43:12] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-11-12 03:43:12] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-11-12 03:43:12] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". thread '' panicked at src/lib.rs:26:50: called Result::unwrap() on an Err value: Error("data did not match any variant of untagged enum ModelWrapper", line: 753179, column: 1) stack backtrace: 0: 0x74b3b405519c - ::fmt::h41fa541dc14fbe51 1: 0x74b3b40a5e70 - core::fmt::write::h0892af1ec116d2e4 2: 0x74b3b404a4cd - std::io::Write::write_fmt::hc85c550e5a70f4cf 3: 0x74b3b4054f84 - std::sys_common::backtrace::print::h5d9aabdcf93aa773 4: 0x74b3b4057c67 - std::panicking::default_hook::{{closure}}::h6943f7db7ebd9dfa 5: 0x74b3b40579cf - std::panicking::default_hook::hc843c2a865849d41 6: 0x74b3b40581a8 - std::panicking::rust_panic_with_hook::hac0a41b89f5ab822 7: 0x74b3b405808e - std::panicking::begin_panic_handler::{{closure}}::h1c034067c5755b7e 8: 0x74b3b4055666 - std::sys_common::backtrace::rust_end_short_backtrace::h3f5e2602c6964099 9: 0x74b3b4057df2 - rust_begin_unwind 10: 0x74b3b3c9b4c5 - core::panicking::panic_fmt::hcd09b86433080a0a 11: 0x74b3b3c9baf3 - core::result::unwrap_failed::h37e38fafe094d785 12: 0x74b3b3e3f27a - tokenizers_new_from_str 13: 0x74b3b3e369e9 - _ZN10tokenizers9Tokenizer12FromBlobJSONERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE at /workspace/mlc-llm/3rdparty/tokenizers-cpp/src/huggingface_tokenizer.cc:108:63 14: 0x74b3b3e33730 - _ZN3mlc3llm9Tokenizer8FromPathERKN3tvm7runtime6StringESt8optionalINS013TokenizerInfoEE at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:157:57 15: 0x74b3b3e34098 - operator() at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:459:34 16: 0x74b3b3e34098 - run<tvm::runtime::TVMMovableArgValueWithContext> at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1974:11 17: 0x74b3b3e34098 - run<> at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1959:60 18: 0x74b3b3e34098 - unpack_call<mlc::llm::Tokenizer, 1, mlc::llm::<lambda(const tvm::runtime::String&)> > at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1999:46 19: 0x74b3b3e34098 - operator() at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:2059:44 20: 0x74b3b3e34098 - Call at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1394:58 21: 0x74b3f2cadeba - TVMFuncCall 22: 0x74b4536ec1a5 - _ZL39pyx_f_3tvm_4_ffi_4_cy3_4core_FuncCallPvP7_objectP8TVMValuePi 23: 0x74b4536ec769 - _ZL76pyx_pw_3tvm_4_ffi_4_cy3_4core_10ObjectBase_3__init_handle_by_constructorP7_objectPKS0lS0 24: 0x652d2438e9cc - _PyObject_VectorcallTstate at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_call.h:92:11 25: 0x652d2438e9cc - PyObject_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:299:12 26: 0x652d24381e36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 27: 0x652d243a54c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 28: 0x652d243a54c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 29: 0x652d243a54c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 30: 0x652d243ac86c - _PyObject_FastCallDictTstate at /usr/local/src/conda/python-3.11.8/Objects/call.c:141:15 31: 0x652d243ac86c - _PyObject_Call_Prepend at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24 32: 0x652d243ac86c - slot_tp_init at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15 33: 0x652d24374303 - type_call at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19 34: 0x652d24374303 - _PyObject_MakeTpCall at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18 35: 0x652d24381e36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 36: 0x652d243a54c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 37: 0x652d243a54c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 38: 0x652d243a54c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 39: 0x652d243ac933 - _PyObject_FastCallDictTstate at /usr/local/src/conda/python-3.11.8/Objects/call.c:152:15 40: 0x652d243ac933 - _PyObject_Call_Prepend at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24 41: 0x652d243ac933 - slot_tp_init at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15 42: 0x652d24374303 - type_call at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19 43: 0x652d24374303 - _PyObject_MakeTpCall at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18 44: 0x652d24381e36 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23 45: 0x652d243a54c1 - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 46: 0x652d243a54c1 - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 47: 0x652d243a54c1 - _PyFunction_Vectorcall at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16 48: 0x652d243af1e0 - _PyVectorcall_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:257:24 49: 0x652d243af1e0 - _PyObject_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:328:16 50: 0x652d243af1e0 - PyObject_Call at /usr/local/src/conda/python-3.11.8/Objects/call.c:355:12 51: 0x652d24386119 - do_call_core at /usr/local/src/conda/python-3.11.8/Python/ceval.c:7349:12 52: 0x652d24386119 - _PyEval_EvalFrameDefault at /usr/local/src/conda/python-3.11.8/Python/ceval.c:5376:22 53: 0x652d2443842d - _PyEval_EvalFrame at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16 54: 0x652d2443842d - _PyEval_Vector at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24 55: 0x652d24437abf - PyEval_EvalCode at /usr/local/src/conda/python-3.11.8/Python/ceval.c:1148:21 56: 0x652d24456a1a - run_eval_code_obj at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1741:9 57: 0x652d24452593 - run_mod at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1762:19 58: 0x652d24467930 - pyrun_file at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1657:15 59: 0x652d244672ce - _PyRun_SimpleFileObject at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:440:13 60: 0x652d24466ff4 - _PyRun_AnyFileObject at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:79:15 61: 0x652d244616f4 - pymain_run_file_obj at /usr/local/src/conda/python-3.11.8/Modules/main.c:360:15 62: 0x652d244616f4 - pymain_run_file at /usr/local/src/conda/python-3.11.8/Modules/main.c:379:15 63: 0x652d244616f4 - pymain_run_python at /usr/local/src/conda/python-3.11.8/Modules/main.c:601:21 64: 0x652d244616f4 - Py_RunMain at /usr/local/src/conda/python-3.11.8/Modules/main.c:680:5 65: 0x652d24427a77 - Py_BytesMain at /usr/local/src/conda/python-3.11.8/Modules/main.c:734:12 66: 0x74b453dd0d90 - 67: 0x74b453dd0e40 - __libc_start_main 68: 0x652d2442791d - fatal runtime error: failed to initiate panic, error 5 Aborted (core dumped)

MasterJH5574 commented 2 days ago

@l241025097 Thank you for reporting. We'll look into the second issue you mentioned.

MasterJH5574 commented 1 day ago

Hi @l241025097, we have fixed this issue. Please upgrade the mlc python package to the latest nightly and try again, thanks!