Open l241025097 opened 1 week ago
InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory
:)
InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory
:)
max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user.
当我减少 prefill chunk size,max KV cache token capacity就自动增加,所以一直都是 oom。
另外,可以看看问题 1。
Please check the config args gpu_memory_utilization
and may be set to a smaller number.
As for Q1, try to use template chatlm
. see ref
Please check the config args
gpu_memory_utilization
and may be set to a smaller number.As for Q1, try to use template
chatlm
. see ref
非常感谢,问题2解决了,gpu_memory_utilization设置到0.7以下就能成功执行,到0.8就不行。但问题1:我修改了template chatlm,在mlc_llm serve时仍然报错:
[2024-11-12 03:43:06] INFO auto_device.py:79: Found device: cuda:0
[2024-11-12 03:43:06] INFO auto_device.py:79: Found device: cuda:1
[2024-11-12 03:43:08] INFO auto_device.py:88: Not found device: rocm:0
[2024-11-12 03:43:09] INFO auto_device.py:88: Not found device: metal:0
[2024-11-12 03:43:11] INFO auto_device.py:88: Not found device: vulkan:0
[2024-11-12 03:43:12] INFO auto_device.py:88: Not found device: opencl:0
[2024-11-12 03:43:12] INFO auto_device.py:35: Using device: cuda:0
[2024-11-12 03:43:12] INFO engine_base.py:143: Using library model: /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so
[2024-11-12 03:43:12] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization).
[2024-11-12 03:43:12] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2024-11-12 03:43:12] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
thread 'Result::unwrap()
on an Err
value: Error("data did not match any variant of untagged enum ModelWrapper", line: 753179, column: 1)
stack backtrace:
0: 0x74b3b405519c -
@l241025097 Thank you for reporting. We'll look into the second issue you mentioned.
Hi @l241025097, we have fixed this issue. Please upgrade the mlc python package to the latest nightly and try again, thanks!
🐛 Bug
1、使用modelscope模型Shanghai_AI_Laboratory/internlm2_5-20b-chat(download至本地),编译时没找到对应的--conv-template,则配置为LM。在执行mlc_llm serve时报错; 2、使用huggingface模型mlc-ai/internlm2_5-20b-q4f32_1-MLC(download至本地),执行mlc_llm serve时报错。
To Reproduce
Steps to reproduce the behavior:
1、问题1: (1)/opt/miniconda3/envs/python3.11/bin/mlc_llm convert_weight /workspace/models/internlm2_5-20b-chat \ --device cuda:1 \ --quantization q4f32_1 \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC
(2)/opt/miniconda3/envs/python3.11/bin/mlc_llm gen_config /workspace/models/internlm2_5-20b-chat \ --quantization q4f32_1 \ --conv-template LM \ --tensor-parallel-shard 2 \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC
(3)mkdir -p /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs
(4)/opt/miniconda3/envs/python3.11/bin/mlc_llm compile /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/mlc-chat-config.json \ --device cuda \ -o /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so
(5)/opt/miniconda3/envs/python3.11/bin/mlc_llm serve \ /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC \ --model-lib /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so \ --mode server \ --host 0.0.0.0
[2024-11-10 04:28:36] INFO auto_device.py:79: Found device: cuda:0 [2024-11-10 04:28:36] INFO auto_device.py:79: Found device: cuda:1 [2024-11-10 04:28:37] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-10 04:28:39] INFO auto_device.py:88: Not found device: metal:0 [2024-11-10 04:28:41] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-10 04:28:42] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-10 04:28:42] INFO auto_device.py:35: Using device: cuda:0 [2024-11-10 04:28:42] INFO engine_base.py:143: Using library model: /workspace/models/internlm2_5-20b-chat-q4f32_1-MLC/libs/internlm2_5-20b-chat-q4f32_1-MLC-cuda.so [2024-11-10 04:28:42] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-11-10 04:28:42] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-11-10 04:28:42] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". thread '' panicked at src/lib.rs:26:50:
called ::fmt::h41fa541dc14fbe51
1: 0x791f176a5e70 - core::fmt::write::h0892af1ec116d2e4
2: 0x791f1764a4cd - std::io::Write::write_fmt::hc85c550e5a70f4cf
3: 0x791f17654f84 - std::sys_common::backtrace::print::h5d9aabdcf93aa773
4: 0x791f17657c67 - std::panicking::default_hook::{{closure}}::h6943f7db7ebd9dfa
5: 0x791f176579cf - std::panicking::default_hook::hc843c2a865849d41
6: 0x791f176581a8 - std::panicking::rust_panic_with_hook::hac0a41b89f5ab822
7: 0x791f1765808e - std::panicking::begin_panic_handler::{{closure}}::h1c034067c5755b7e
8: 0x791f17655666 - std::sys_common::backtrace::rust_end_short_backtrace::h3f5e2602c6964099
9: 0x791f17657df2 - rust_begin_unwind
10: 0x791f1729b4c5 - core::panicking::panic_fmt::hcd09b86433080a0a
11: 0x791f1729baf3 - core::result::unwrap_failed::h37e38fafe094d785
12: 0x791f1743f27a - tokenizers_new_from_str
13: 0x791f174369e9 - _ZN10tokenizers9Tokenizer12FromBlobJSONERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
at /workspace/mlc-llm/3rdparty/tokenizers-cpp/src/huggingface_tokenizer.cc:108:63
14: 0x791f17433730 - _ZN3mlc3llm9Tokenizer8FromPathERKN3tvm7runtime6StringESt8optionalINS013TokenizerInfoEE
at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:157:57
15: 0x791f17434098 - operator()
at /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:459:34
16: 0x791f17434098 - run<tvm::runtime::TVMMovableArgValueWithContext>
at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1974:11
17: 0x791f17434098 - run<>
at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1959:60
18: 0x791f17434098 - unpack_call<mlc::llm::Tokenizer, 1, mlc::llm::<lambda(const tvm::runtime::String&)> >
at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1999:46
19: 0x791f17434098 - operator()
at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:2059:44
20: 0x791f17434098 - Call
at /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1394:58
21: 0x791f562adeba - TVMFuncCall
22: 0x791fb6df31a5 - _ZL39pyx_f_3tvm_4_ffi_4_cy3_4core_FuncCallPvP7_objectP8TVMValuePi
23: 0x791fb6df3769 - _ZL76pyx_pw_3tvm_4_ffi_4_cy3_4core_10ObjectBase_3__init_handle_by_constructorP7_objectPKS0lS0
24: 0x579946b8a9cc - _PyObject_VectorcallTstate
at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_call.h:92:11
25: 0x579946b8a9cc - PyObject_Vectorcall
at /usr/local/src/conda/python-3.11.8/Objects/call.c:299:12
26: 0x579946b7de36 - _PyEval_EvalFrameDefault
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23
27: 0x579946ba14c1 - _PyEval_EvalFrame
at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16
28: 0x579946ba14c1 - _PyEval_Vector
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24
29: 0x579946ba14c1 - _PyFunction_Vectorcall
at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16
30: 0x579946ba886c - _PyObject_FastCallDictTstate
at /usr/local/src/conda/python-3.11.8/Objects/call.c:141:15
31: 0x579946ba886c - _PyObject_Call_Prepend
at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24
32: 0x579946ba886c - slot_tp_init
at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15
33: 0x579946b70303 - type_call
at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19
34: 0x579946b70303 - _PyObject_MakeTpCall
at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18
35: 0x579946b7de36 - _PyEval_EvalFrameDefault
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23
36: 0x579946ba14c1 - _PyEval_EvalFrame
at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16
37: 0x579946ba14c1 - _PyEval_Vector
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24
38: 0x579946ba14c1 - _PyFunction_Vectorcall
at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16
39: 0x579946ba8933 - _PyObject_FastCallDictTstate
at /usr/local/src/conda/python-3.11.8/Objects/call.c:152:15
40: 0x579946ba8933 - _PyObject_Call_Prepend
at /usr/local/src/conda/python-3.11.8/Objects/call.c:482:24
41: 0x579946ba8933 - slot_tp_init
at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:7854:15
42: 0x579946b70303 - type_call
at /usr/local/src/conda/python-3.11.8/Objects/typeobject.c:1103:19
43: 0x579946b70303 - _PyObject_MakeTpCall
at /usr/local/src/conda/python-3.11.8/Objects/call.c:214:18
44: 0x579946b7de36 - _PyEval_EvalFrameDefault
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769:23
45: 0x579946ba14c1 - _PyEval_EvalFrame
at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16
46: 0x579946ba14c1 - _PyEval_Vector
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24
47: 0x579946ba14c1 - _PyFunction_Vectorcall
at /usr/local/src/conda/python-3.11.8/Objects/call.c:393:16
48: 0x579946bab1e0 - _PyVectorcall_Call
at /usr/local/src/conda/python-3.11.8/Objects/call.c:257:24
49: 0x579946bab1e0 - _PyObject_Call
at /usr/local/src/conda/python-3.11.8/Objects/call.c:328:16
50: 0x579946bab1e0 - PyObject_Call
at /usr/local/src/conda/python-3.11.8/Objects/call.c:355:12
51: 0x579946b82119 - do_call_core
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:7349:12
52: 0x579946b82119 - _PyEval_EvalFrameDefault
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:5376:22
53: 0x579946c3442d - _PyEval_EvalFrame
at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73:16
54: 0x579946c3442d - _PyEval_Vector
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434:24
55: 0x579946c33abf - PyEval_EvalCode
at /usr/local/src/conda/python-3.11.8/Python/ceval.c:1148:21
56: 0x579946c52a1a - run_eval_code_obj
at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1741:9
57: 0x579946c4e593 - run_mod
at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1762:19
58: 0x579946c63930 - pyrun_file
at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1657:15
59: 0x579946c632ce - _PyRun_SimpleFileObject
at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:440:13
60: 0x579946c62ff4 - _PyRun_AnyFileObject
at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:79:15
61: 0x579946c5d6f4 - pymain_run_file_obj
at /usr/local/src/conda/python-3.11.8/Modules/main.c:360:15
62: 0x579946c5d6f4 - pymain_run_file
at /usr/local/src/conda/python-3.11.8/Modules/main.c:379:15
63: 0x579946c5d6f4 - pymain_run_python
at /usr/local/src/conda/python-3.11.8/Modules/main.c:601:21
64: 0x579946c5d6f4 - Py_RunMain
at /usr/local/src/conda/python-3.11.8/Modules/main.c:680:5
65: 0x579946c23a77 - Py_BytesMain
at /usr/local/src/conda/python-3.11.8/Modules/main.c:734:12
66: 0x791fb74d7d90 -
67: 0x791fb74d7e40 - __libc_start_main
68: 0x579946c2391d -
fatal runtime error: failed to initiate panic, error 5
Aborted (core dumped)
Result::unwrap()
on anErr
value: Error("data did not match any variant of untagged enum ModelWrapper", line: 753179, column: 1) stack backtrace: 0: 0x791f1765519c -2、问题2: /opt/miniconda3/envs/python3.11/bin/mlc_llm serve \ /workspace/models/internlm2_5-20b-q4f32_1-MLC \ --mode server \ --host 0.0.0.0 \ --overrides "tensor_parallel_shards=2;prefill_chunk_size=512;gpu_memory_utilization=0.95;max_num_sequence=32"
[2024-11-10 05:05:55] INFO auto_device.py:79: Found device: cuda:0 [2024-11-10 05:05:55] INFO auto_device.py:79: Found device: cuda:1 [2024-11-10 05:05:56] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-10 05:05:58] INFO auto_device.py:88: Not found device: metal:0 [2024-11-10 05:06:00] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-10 05:06:01] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-10 05:06:01] INFO auto_device.py:35: Using device: cuda:0 [2024-11-10 05:06:01] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-11-10 05:06:01] INFO jit.py:118: Compiling using commands below: [2024-11-10 05:06:01] INFO jit.py:119: /opt/miniconda3/envs/python3.11/bin/python3 -m mlc_llm compile /workspace/models/internlm2_5-20b-q4f32_1-MLC --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'prefill_chunk_size=512;max_batch_size=32;tensor_parallel_shards=2' --device cuda:0 --output /tmp/tmpb33h687o/lib.so [2024-11-10 05:06:03] INFO auto_config.py:70: Found model configuration: /workspace/models/internlm2_5-20b-q4f32_1-MLC/mlc-chat-config.json [2024-11-10 05:06:03] INFO auto_target.py:91: Detecting target device: cuda:0 [2024-11-10 05:06:03] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(32), "arch": "sm_86", "max_threads_per_block": runtime.BoxInt(1024), "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]} [2024-11-10 05:06:03] INFO auto_target.py:110: Found host LLVM triple: x86_64-redhat-linux-gnu [2024-11-10 05:06:03] INFO auto_target.py:111: Found host LLVM CPU: haswell [2024-11-10 05:06:03] INFO auto_target.py:334: Generating code for CUDA architecture: sm_86 [2024-11-10 05:06:03] INFO auto_target.py:335: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90a [2024-11-10 05:06:03] INFO auto_config.py:154: Found model type: internlm2. Use)
3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional)
4: tvm::runtime::relax_vm::PagedAttentionKVCacheObj::PagedAttentionKVCacheObj(long, long, long, long, long, long, long, long, long, bool, tvm::runtime::relax_vm::RoPEMode, double, double, tvm::runtime::Optional, DLDataType, DLDevice, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional)
5: tvm::runtime::relax_vm::__mk_TVM2::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue )#1}::operator()(tvm::runtime::relax_vm::__mk_TVM2, tvm::runtime::TVMRetValue) const [clone .constprop.0]
6: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame, tvm::runtime::relax_vm::Instruction)
7: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
8: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator > const&)
9: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue )#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)
10: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)
11: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
12: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
13: execute_native_thread_routine
at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
14: 0x0000755e4b02fac2
15: __clone
16: 0xffffffffffffffff
--model-type
to override. Compiling with arguments: --config InternLM2Config(vocab_size=92544, hidden_size=6144, num_hidden_layers=48, num_attention_heads=48, num_key_value_heads=8, rms_norm_eps=1e-05, intermediate_size=16384, bias=False, use_cache=True, rope_theta=50000000, pad_token_id=2, bos_token_id=1, eos_token_id=2, context_window_size=262144, prefill_chunk_size=2048, tensor_parallel_shards=1, max_batch_size=80, head_dim=128, kwargs={}) --quantization GroupQuantize(name='q4f32_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float32', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0) --model-type internlm2 --target {"thread_warp_size": runtime.BoxInt(32), "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "haswell", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": runtime.BoxInt(1024), "libs": ["thrust"], "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]} --opt flashinfer=1;cublas_gemm=0;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE --system-lib-prefix "" --output /tmp/tmpb33h687o/lib.so --overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=512;attention_sink_size=None;max_batch_size=32;tensor_parallel_shards=2;pipeline_parallel_stages=None [2024-11-10 05:06:03] INFO config.py:107: Overriding prefill_chunk_size from 2048 to 512 [2024-11-10 05:06:03] INFO config.py:107: Overriding max_batch_size from 80 to 32 [2024-11-10 05:06:03] INFO config.py:107: Overriding tensor_parallel_shards from 1 to 2 [2024-11-10 05:06:03] INFO compile.py:140: Creating model from: InternLM2Config(vocab_size=92544, hidden_size=6144, num_hidden_layers=48, num_attention_heads=48, num_key_value_heads=8, rms_norm_eps=1e-05, intermediate_size=16384, bias=False, use_cache=True, rope_theta=50000000, pad_token_id=2, bos_token_id=1, eos_token_id=2, context_window_size=262144, prefill_chunk_size=2048, tensor_parallel_shards=1, max_batch_size=80, head_dim=128, kwargs={}) [2024-11-10 05:06:03] INFO compile.py:158: Exporting the model to TVM Unity compiler [2024-11-10 05:06:07] INFO compile.py:164: Running optimizations using TVM Unity [2024-11-10 05:06:07] INFO compile.py:185: Registering metadata: {'model_type': 'internlm2', 'quantization': 'q4f32_1', 'context_window_size': 262144, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 512, 'tensor_parallel_shards': 2, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 32} [2024-11-10 05:06:09] INFO pipeline.py:54: Running TVM Relax graph-level optimizations [2024-11-10 05:06:14] INFO pipeline.py:54: Lowering to TVM TIR kernels [2024-11-10 05:06:25] INFO pipeline.py:54: Running TVM TIR-level optimizations [2024-11-10 05:07:00] INFO pipeline.py:54: Running TVM Dlight low-level optimizations [2024-11-10 05:07:02] INFO pipeline.py:54: Lowering to VM bytecode [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionalloc_embedding_tensor
: 12.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionargsort_probs
: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionbatch_decode
: 15.05 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionbatch_prefill
: 71.30 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionbatch_verify
: 240.75 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functioncreate_tir_paged_kv_cache
: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functiondecode
: 0.47 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionembed
: 12.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionmultinomial_from_uniform
: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionprefill
: 60.38 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionrenormalize_by_top_p
: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionsample_with_top_p
: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionsampler_take_probs
: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionsampler_verify_draft_tokens
: 0.00 MB [2024-11-10 05:07:08] INFO estimate_memory_usage.py:58: [Memory usage] Functionsoftmax_with_temperature
: 0.00 MB [2024-11-10 05:07:17] INFO pipeline.py:54: Compiling external modules [2024-11-10 05:07:17] INFO pipeline.py:54: Compilation complete! Exporting to disk [2024-11-10 05:07:47] INFO model_metadata.py:95: Total memory usage without KV cache:: 6500.84 MB (Parameters: 6260.09 MB. Temporary buffer: 240.75 MB) [2024-11-10 05:07:47] INFO model_metadata.py:103: To reduce memory usage, tweakprefill_chunk_size
,context_window_size
andsliding_window_size
[2024-11-10 05:07:47] INFO compile.py:207: Generated: /tmp/tmpb33h687o/lib.so [2024-11-10 05:07:48] INFO jit.py:126: Using compiled model lib: /root/.cache/mlc_llm/model_lib/0eb086393737fad474780549d0878131.so [2024-11-10 05:07:48] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-11-10 05:07:48] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-11-10 05:07:48] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size 32 is specified by user, max KV cache token capacity will be set to 8192, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size 32 is specified by user, max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size 32 is specified by user, max KV cache token capacity will be set to 172266, prefill chunk size 512 is specified by user. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "server". So max batch size is 32, max KV cache token capacity is 172266, prefill chunk size is 512. [05:07:50] /workspace/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 23037.859 MB (Parameters: 6260.086 MB. KVCache: 16226.536 MB. Temporary buffer: 551.237 MB). The actual usage might be slightly larger than the estimated number. [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #0] Loading model to device: cuda:0 [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #1] Loading model to device: cuda:1 [05:07:50] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:175: Loading parameters... [==================================================================================================>] [485/485] [05:08:04] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:203: Loading done. Time used: Loading 11.741 s Preprocessing 2.289 s. terminate called after throwing an instance of 'tvm::runtime::InternalError' what(): [05:08:05] /workspace/tvm/src/runtime/cuda/cuda_device_api.cc:145: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory Stack trace: 0: _ZN3tvm7runtime6deta 1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType) 2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const, DLDataType, tvm::runtime::OptionalAborted (core dumped) root@34981ae00917:/# Traceback (most recent call last): File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 58, in
main()
File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 53, in main
worker_func(worker_id, num_workers, num_groups, reader, writer)
File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 284, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
File "/opt/miniconda3/envs/python3.11/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm.error.InternalError: Traceback (most recent call last):
14: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (int, int, int, long, long)>::AssignTypedLambda<void ()(int, int, int, long, long)>(void ()(int, int, int, long, long), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)
13: tvm::runtime::WorkerProcess(int, int, int, long, long)
12: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker)
11: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
10: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)
9: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)
8: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator > const&)
7: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
6: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame , tvm::runtime::relax_vm::Instruction)
5: tvm::runtime::relax_vm:: mk_TVM2::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::relax_vm::__mk_TVM2, tvm::runtime::TVMRetValue) const [clone .constprop.0]
4: tvm::runtime::relax_vm::PagedAttentionKVCacheObj::PagedAttentionKVCacheObj(long, long, long, long, long, long, long, long, long, bool, tvm::runtime::relax_vm::RoPEMode, double, double, tvm::runtime::Optional, DLDataType, DLDevice, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::Optional, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::PackedFunc, tvm::runtime::Optional)
3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional)
2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const , DLDataType, tvm::runtime::Optional)
1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType)
0: _ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/cuda/cuda_device_api.cc", line 145
InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory
Expected behavior
Environment
Additional context